6838:
5063:
6822:
6814:
5137:, and each layer in a transformer model has multiple attention heads. While each attention head attends to the tokens that are relevant to each token, multiple attention heads allow the model to do this for different definitions of "relevance". In addition, the influence field representing relevance can become progressively dilated in successive layers. Many transformer attention heads encode relevance relations that are meaningful to humans. For example, some attention heads can attend mostly to the next word, while others mainly attend from verbs to their direct objects. The computations for each attention head can be performed in
3497:
3164:
3156:
3489:
5055:
3200:
6802:
6830:
971:
6580:
6169:
6445:
8151:
16430:
15506:
2161:
16410:
6183:
7768:
6087:
11425:
15516:
8940:
6440:{\displaystyle {\begin{aligned}{\text{given input vectors }}&h_{0},h_{1},\dots \\{\text{combine them into a matrix }}H&={\begin{bmatrix}h_{0}\\h_{1}\\\vdots \end{bmatrix}}\\{\text{EncoderLayer}}(H)&={\begin{bmatrix}{\text{FFN}}({\text{MultiheadedAttention}}(H,H,H)_{0})\\{\text{FFN}}({\text{MultiheadedAttention}}(H,H,H)_{1})\\\vdots \end{bmatrix}}\\\end{aligned}}}
5897:
11117:
8146:{\displaystyle {\text{RoPE}}{\big (}x_{m}^{(1)},x_{m}^{(2)},m{\big )}={\begin{pmatrix}\cos m\theta &-\sin m\theta \\\sin m\theta &\cos m\theta \end{pmatrix}}{\begin{pmatrix}x_{m}^{(1)}\\x_{m}^{(2)}\\\end{pmatrix}}={\begin{pmatrix}x_{m}^{(1)}\cos m\theta -x_{m}^{(2)}\sin m\theta \\x_{m}^{(2)}\cos m\theta +x_{m}^{(1)}\sin m\theta \\\end{pmatrix}}}
1201:(Luong et al, 2015) compared the relative performance of global (that of (Bahdanau et al, 2014)) and local (sliding window) attention model architectures for machine translation, and found that a mixed attention architecture had higher quality than global attention, while the use of a local attention architecture reduced translation time.
6755:
8757:
1133:. LSTM became the standard architecture for long sequence modelling until the 2017 publication of Transformers. However, LSTM still used sequential processing, like most other RNNs. Specifically, RNNs operate one token at a time from first to last; they cannot operate in parallel over all tokens in a sequence.
11112:
14263:
Chowdhery, Aakanksha; Narang, Sharan; Devlin, Jacob; Bosma, Maarten; Mishra, Gaurav; Roberts, Adam; Barham, Paul; Chung, Hyung Won; Sutton, Charles; Gehrmann, Sebastian; Schuh, Parker; Shi, Kensen; Tsvyashchenko, Sasha; Maynez, Joshua; Rao, Abhishek (2022-04-01). "PaLM: Scaling
Language Modeling with
9033:
ALiBi allows pretraining on short context windows, then finetuning on longer context windows. Since it is directly plugged into the attention mechanism, it can be combined with any positional encoder that is plugged into the "bottom" of the entire network (which is where the sinusoidal encoder on the
7362:
A "decoder-only" Transformer is not literally decoder-only, since without an encoder, the cross-attention mechanism has nothing to attend to. Thus, the decoder layers in a decoder-only
Transformer is composed of just two sublayers: the causally masked self-attention, and the feedforward network. This
1641:
Transformer layers, which carry out repeated transformations on the vector representations, extracting more and more linguistic information. These consist of alternating attention and feedforward layers. There are two major types of transformer layers: encoder layers and decoder layers, with further
1140:
controller (1992) learns to compute a weight matrix for further processing depending on the input. One of its two networks has "fast weights" or "dynamic links" (1981). A slow neural network learns by gradient descent to generate keys and values for computing the weight changes of the fast neural
9771:
Transformers are used in large language models for autoregressive sequence generation: generating a stream of text, one token at a time. However, in most settings, decoding from language models is memory-bound, meaning that we have spare compute power available. Speculative decoding uses this spare
8558:
1697:
As the
Transformer architecture natively processes numerical data, not text, there must be a translation between text and tokens. A token is an integer that represents a character, or a short segment of characters. On the input side, the input text is parsed into a token sequence. Similarly, on the
1607:
In a prefixLM task, the sequence is divided into two parts. The first part is presented as context, and the model predicts the first token of the second part. Then that would be revealed, and the model predicts the second token, and so on. The loss function for the task is still typically the same.
1382:
The plain transformer architecture had difficulty converging. In the original paper the authors recommended using learning rate warmup. That is, the learning rate should linearly scale up from 0 to maximal value for the first part of the training (usually recommended to be 2% of the total number of
12382:
Wolf, Thomas; Debut, Lysandre; Sanh, Victor; Chaumond, Julien; Delangue, Clement; Moi, Anthony; Cistac, Pierric; Rault, Tim; Louf, Remi; Funtowicz, Morgan; Davison, Joe; Shleifer, Sam; von Platen, Patrick; Ma, Clara; Jernite, Yacine; Plu, Julien; Xu, Canwen; Le Scao, Teven; Gugger, Sylvain; Drame,
11825:
to an image. Parti is an encoder-decoder
Transformer, where the encoder processes a text prompt, and the decoder generates a token representation of an image. Muse is an encoder-only Transformer that is trained to predict masked image tokens from unmasked image tokens. During generation, all input
7542:
The original
Transformer paper reported using a learned positional encoding, but finding it not superior to the sinusoidal one. Later, found that causal masking itself provides enough signal to a Transformer decoder that it can learn to implicitly perform absolute positional encoding without the
6598:
Like the first encoder, the first decoder takes positional information and embeddings of the output sequence as its input, rather than encodings. The transformer must not use the current or future output to predict an output, so the output sequence must be partially masked to prevent this reverse
3183:
The purpose of each encoder layer is to create contextualized representations of the tokens, where each representation corresponds to a token that "mixes" information from other input tokens via self-attention mechanism. Each decoder layer contains two attention sublayers: (1) cross-attention for
7486:
where the first columns correspond to the "prefix", and the subsequent columns correspond to the autoregressively generated text based on the prefix. They resemble encoder-decoder models, but has less "sparsity". Such models are rarely used, though they are cited as theoretical possibilities and
9350:
Key advancements in FlashAttention-2 include the reduction of non-matmul FLOPs, improved parallelism over the sequence length dimension, better work partitioning between GPU warps, and added support for head dimensions up to 256 and multi-query attention (MQA) and grouped-query attention (GQA).
1187:-size output vector, which was then processed by another recurrent network into an output. If the input is long, then the output vector would not be able to contain all relevant information, and the output quality degrades. As evidence, reversing the input sentence improved seq2seq translation.
6590:
Each decoder consists of three major components: a causally masked self-attention mechanism, a cross-attention mechanism, and a feed-forward neural network. The decoder functions in a similar fashion to the encoder, but an additional attention mechanism is inserted which instead draws relevant
5889:
6179:
Each encoder layer consists of two major components: a self-attention mechanism and a feed-forward layer. It takes an input as a sequence of input vectors, applies the self-attention mechanism, to produce an intermediate sequence of vectors, then applies the feed-forward layer for each vector
13248:
Dosovitskiy, Alexey; Beyer, Lucas; Kolesnikov, Alexander; Weissenborn, Dirk; Zhai, Xiaohua; Unterthiner, Thomas; Dehghani, Mostafa; Minderer, Matthias; Heigold, Georg; Gelly, Sylvain; Uszkoreit, Jakob (2021-06-03). "An Image is Worth 16x16 Words: Transformers for Image
Recognition at Scale".
11826:
tokens are masked, and the highest-confidence predictions are included for the next iteration, until all tokens are predicted. Phenaki is a text-to-video model. It is a bidirectional masked transformer conditioned on pre-computed text tokens. The generated tokens are then decoded to a video.
7113:
The original 2017 Transformer used the post-LN convention. It was difficult to train and required careful hyperparameter tuning and a "warm-up" in learning rate, where it starts small and gradually increases. The pre-LN convention was developed in 2020, which was found to be easier to train,
6567:
The encoder layers are stacked. The first encoder layer takes the sequence of input vectors from the embedding layer, producing a sequence of vectors. This sequence of vectors is processed by the second encoder, and so on. The output from the final encoder layer is then used by the decoder.
11664:
8708:
9535:
9151:
6082:{\displaystyle M_{\text{causal}}={\begin{bmatrix}0&-\infty &-\infty &\dots &-\infty \\0&0&-\infty &\dots &-\infty \\0&0&0&\dots &-\infty \\\vdots &\vdots &\vdots &\ddots &\vdots \\0&0&0&\dots &0\end{bmatrix}}}
6787:
The last decoder is followed by a final un-embedding layer. to produce the output probabilities over the vocabulary. Then, one of the tokens is sampled according to the probability, and the decoder can be run again to produce the next token, etc, autoregressively generating output text.
14662:
Choromanski, Krzysztof; Likhosherstov, Valerii; Dohan, David; Song, Xingyou; Gane, Andreea; Sarlos, Tamas; Hawkins, Peter; Davis, Jared; Belanger, David; Colwell, Lucy; Weller, Adrian (2020-09-30). "Masked
Language Modeling for Proteins via Linearly Scalable Long-Context Transformers".
4716:
1599:
In an autoregressive task, the entire sequence is masked at first, and the model produces a probability distribution for the first token. Then the first token is revealed and the model predicts the second token, and so on. The loss function for the task is still typically the same. The
1216:. The new model was a seq2seq model where the encoder and the decoder were both 8 layers of bidirectional LSTM. It took nine months to develop, and it achieved a higher level of performance than the statistical approach, which took ten years to develop. In the same year, self-attention
11420:{\displaystyle {\text{Attention}}(q,K,V)={\text{softmax}}\left({\frac {qK^{\mathrm {T} }}{\sqrt {d_{k}}}}\right)V\approx {\frac {\varphi (q)^{T}\sum _{i}e^{\|k_{i}\|^{2}/2\sigma ^{2}}\varphi (k_{i})v_{i}^{T}}{\varphi (q)^{T}\sum _{i}e^{\|k_{i}\|^{2}/2\sigma ^{2}}\varphi (k_{i})}}}
11778:
Multimodal models can either be trained from scratch, or by finetuning. A 2022 study found that
Transformers pretrained only on natural language can be finetuned on only 0.03% of parameters and become competitive with LSTMs on a variety of logical and visual tasks, demonstrating
9735:
5330:
14814:
Jaegle, Andrew; Borgeaud, Sebastian; Alayrac, Jean-Baptiste; Doersch, Carl; Ionescu, Catalin; Ding, David; Koppula, Skanda; Zoran, Daniel; Brock, Andrew; Shelhamer, Evan; Hénaff, Olivier (2021-08-02). "Perceiver IO: A General
Architecture for Structured Inputs & Outputs".
7490:
There are also mixed seq2seq models. For example, in 2020, Google
Translate replaced the previous RNN-encoder–RNN-decoder model by a Transformer-encoder–RNN-decoder model, on the argument that an RNN-decoder runs much faster than Transformer-decoder when run autoregressively.
4935:
1590:
6613:
11821:(2021), Parti (2022), Phenaki (2023), and Muse (2023). Unlike later models, DALL-E is not a diffusion model. Instead, it uses a decoder-only Transformer that autoregressively generates a text, followed by the token representation of an image, which is then converted by a
3133:. This allows the transformer to take any encoded position and find a linear sum of the encoded locations of its neighbors. This sum of encoded positions, when fed into the attention mechanism, would create attention weights on its neighbors, much like what happens in a
7484:
6861:(LayerNorm, or LN), which while conceptually unnecessary, are necessary for numerical stability and convergence. Similarly to how the feedforward network modules are applied individually to each vector, the LayerNorm is also applied individually to each vector.
1291:
Already in spring 2017, even before the "Attention is all you need" preprint was published, one of the co-authors applied the "decoder-only" variation of the architecture to generate fictitious Knowledge articles. Transformer architecture is now used in many
9762:
If a transformer is used with a baked-in prompt, such as , then the key and value vectors can be computed for the prompt, and saved on disk. The saving in compute is significant when the model is used for many short interactions, such as in online chatbots.
3104:
3179:
architecture. The encoder consists of encoding layers that process all the input tokens together one layer after another, while the decoder consists of decoding layers that iteratively process the encoder's output and the decoder's output tokens so far.
1182:
word of the source text was processed. Although in theory such a vector retains the information about the whole original sentence, in practice the information is poorly preserved, since the input is processed sequentially by one recurrent network into a
9334:
An improved version, FlashAttention-2, was developed to cater to the rising demand for language models capable of handling longer context lengths. It offers enhancements in work partitioning and parallelism, enabling it to achieve up to 230 TFLOPs/s on
2484:
1509:
In general, there are 3 classes of language modelling tasks: "masked", "autoregressive", and "prefixLM". These classes are independent of a specific modeling architecture such as Transformer, but they are often discussed in the context of Transformer.
1283:, by removing its recurrence to process all tokens in parallel, but preserving its dot-product attention mechanism to keep its text processing performance. Its parallelizability was an important factor to its widespread use in large neural networks.
10816:
10405:
Ordinary transformers require a memory size that is quadratic in the size of the context window. Attention-free transformers reduce this to a linear dependence while still retaining the advantages of a transformer by linking the key to the value.
8935:{\displaystyle B={\begin{pmatrix}0&1&2&3&\cdots \\-1&0&1&2&\cdots \\-2&-1&0&1&\cdots \\-3&-2&-1&0&\cdots \\\vdots &\vdots &\vdots &\vdots &\ddots \\\end{pmatrix}}}
8405:
9895:
In speculative decoding, a smaller model or some other simple heuristic is used to generate a few speculative tokens that are subsequently verified by the larger model. For example, suppose a small model generated four speculative tokens:
5783:
1155:
The idea of encoder-decoder sequence transduction had been developed in the early 2010s (see for previous papers). The papers most commonly cited as the originators that produced seq2seq are two concurrently published papers from 2014.
1323:(2018), an encoder-only Transformer model. In 2019 October, Google started using BERT to process search queries. In 2020, Google Translate replaced the previous RNN-encoder–RNN-decoder model by a Transformer-encoder–RNN-decoder model.
6540:
11530:
8599:
9372:
9045:
12275:
Parisotto, Emilio; Song, Francis; Rae, Jack; Pascanu, Razvan; Gulcehre, Caglar; Jayakumar, Siddhant; Jaderberg, Max; Kaufman, Raphaël Lopez; Clark, Aidan; Noury, Seb; Botvinick, Matthew; Heess, Nicolas; Hadsell, Raia (2020-11-21).
13273:
Gulati, Anmol; Qin, James; Chiu, Chung-Cheng; Parmar, Niki; Zhang, Yu; Yu, Jiahui; Han, Wei; Wang, Shibo; Zhang, Zhengdong; Wu, Yonghui; Pang, Ruoming (2020). "Conformer: Convolution-augmented Transformer for Speech Recognition".
4618:
9996:
3184:
incorporating the output of encoder (contextualized input token representations), and (2) self-attention for "mixing" information among the input tokens to the decoder (i.e. the tokens generated so far during inference time).
10172:
For non-greedy decoding, similar ideas apply, except the speculative tokens are accepted or rejected stochastically, in a way that guarantees the final output distribution is the same as if speculative decoding was not used.
9779:
Specifically, consider a transformer model like GPT-3 with a context window size of 512. To generate an entire context window autoregressively with greedy decoding, it must be run for 512 times, each time generating a token
13426:
Raffel, Colin; Shazeer, Noam; Roberts, Adam; Lee, Katherine; Narang, Sharan; Matena, Michael; Zhou, Yanqi; Li, Wei; Liu, Peter J. (2019). "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer".
9582:
14902:
Chang, Huiwen; Zhang, Han; Barber, Jarred; Maschinot, A. J.; Lezama, Jose; Jiang, Lu; Yang, Ming-Hsuan; Murphy, Kevin; Freeman, William T. (2023-01-02). "Muse: Text-To-Image Generation via Masked Generative Transformers".
2785:
1486:
judging the pragmatic acceptability of natural language. For example, the following sentence might be judged "not acceptable", because even though it is syntactically well-formed, it is improbable in ordinary human usage:
5171:
6606:
In contrast, the cross-attention mechanism attends to the output vectors of the encoder, which is computed before the decoder starts decoding. Consequently, there is no need for masking in the cross-attention mechanism.
4846:
2291:
1524:
1243:
Seq2seq models with attention (including self-attention) still suffered from the same issue with recurrent networks, which is that they are hard to parallelize, which prevented them to be accelerated on GPUs. In 2016,
14483:
Tay, Yi; Dehghani, Mostafa; Abnar, Samira; Shen, Yikang; Bahri, Dara; Pham, Philip; Rao, Jinfeng; Yang, Liu; Ruder, Sebastian; Metzler, Donald (2020-11-08). "Long Range Arena: A Benchmark for Efficient Transformers".
7382:
An "encoder-decoder" Transformer is generally the same as the original Transformer, with 2 sublayers per encoder layer and 3 sublayers per decoder layer, etc. They might have minor architectural improvements, such as
6809:
Each encoder layer contains 2 sublayers: the self-attention and the feedforward network. Each decoder layer contains 3 sublayers: the causally masked self-attention, the cross-attention, and the feedforward network.
9354:
Benchmarks revealed FlashAttention-2 to be up to 2x faster than FlashAttention and up to 9x faster than a standard attention implementation in PyTorch. Future developments include optimization for new hardware like
5569:
3385:
8310:
1494:
Note that while each of these tasks is trivial or obvious for human native speakers of the language (or languages), they have typically proved challenging for previous generations of machine learning architecture.
7409:
10811:
5044:
2005:
An un-embedding layer is almost the reverse of an embedding layer. Whereas an embedding layer converts a token into a vector, an un-embedding layer converts a vector into a probability distribution over tokens.
1011:
within the scope of the context window with other (unmasked) tokens via a parallel multi-head attention mechanism allowing the signal for key tokens to be amplified and less important tokens to be diminished.
14241:
Ainslie, Joshua; Lee-Thorp, James; de Jong, Michiel; Zemlyanskiy, Yury; Lebrón, Federico; Sanghai, Sumit (2023-12-23). "GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints".
4991:
974:
A standard Transformer architecture, showing on the left an encoder, and on the right a decoder. Note: it uses the pre-LN convention, which is different from the post-LN convention used in the original 2017
13363:
Xiong, Ruibin; Yang, Yunchang; He, Di; Zheng, Kai; Zheng, Shuxin; Xing, Chen; Zhang, Huishuai; Lan, Yanyan; Wang, Liwei; Liu, Tie-Yan (2020-06-29). "On Layer Normalization in the Transformer Architecture".
5661:
7111:
6963:
2917:
2104:
13908:
Nguyen, Toan Q.; Salazar, Julian (2019-11-02). Niehues, Jan; Cattoni, Rolando; Stüker, Sebastian; Negri, Matteo; Turchi, Marco; Ha, Thanh-Le; Salesky, Elizabeth; Sanabria, Ramon; Barrault, Loic (eds.).
9747:
When an autoregressive transformer is used for inference, such as generating text, the query vector is different at each step, but the already-computed key and value vectors are always the same. The
2956:
10597:
3995:
2150:
6603:
text generation. For decoding, all-to-all attention is inappropriate, because a token cannot attend to tokens not yet generated. Thus, the self-attention module in the decoder is causally masked.
2674:
2551:
2320:
9050:
8604:
6618:
6188:
5788:
4623:
2950:
is the distance one wishes to shift. This allows the transformer to take any encoded position, and find the encoding of the position n-steps-ahead or n-steps-behind, by a matrix multiplication.
8397:
8229:
3770:
5131:
3467:
5418:
3191:
for additional processing of their outputs and contain residual connections and layer normalization steps. These feed-forward layers contain most of the parameters in a Transformer model.
6750:{\displaystyle {\begin{aligned}H'&={\text{MaskedMultiheadedAttention}}(H,H,H)\\{\text{DecoderLayer}}(H)&={\text{FFN}}({\text{MultiheadedAttention}}(H',H^{E},H^{E}))\end{aligned}}}
9237:
3620:
11525:
6134:
2948:
9843:
3899:
3813:
1513:
In a masked task, one or more of the tokens is masked out, and the model would produce a probability distribution predicting what the masked-out tokens are based on the context. The
1115:(1995), a RNN which used various innovations to overcome the vanishing gradient problem, allowing efficient learning of long-sequence modelling. One key innovation was the use of an
12358:
Ruoss, Anian; Delétang, Grégoire; Medapati, Sourabh; Grau-Moya, Jordi; Wenliang, Li; Catt, Elliot; Reid, John; Genewein, Tim (2024-02-07). "Grandmaster-Level Chess Without Search".
7015:
3856:
3647:
1958:
1733:
14863:
Villegas, Ruben; Babaeizadeh, Mohammad; Kindermans, Pieter-Jan; Moraldo, Hernan; Zhang, Han; Saffar, Mohammad Taghi; Castro, Santiago; Kunze, Julius; Erhan, Dumitru (2022-09-29).
11793:
adapt the transformer to computer vision by breaking down input images as a series of patches, turning them into vectors, and treating them like tokens in a standard transformer.
7359:
is encoder-only. They are less often used currently, as they were found to be not significantly better than training an encoder-decoder Transformer, then taking just the encoder.
3685:
1656:
By convention, we write all vectors as row vectors. This, for example, means that pushing a vector through a linear layer means multiplying it by a weight matrix on the right, as
11471:
9285:
6474:
11761:
10694:
3955:
10167:
10131:
10068:
10032:
4811:
4751:
4375:
4295:
11716:
10649:
5601:
5495:
4841:
4157:
7351:
An "encoder-only" Transformer applies the encoder to map an input text into a sequence of vectors that represent the input text. This is usually used for text embedding and
4781:
3255:
3228:
1996:
1129:
2209:
A positional encoding is a fixed-size vector representation of the relative positions of tokens within a sequence: it provides the transformer model with information about
2204:
7406:
A "prefixLM" (prefix language model) is a decoder-only architecture, but with prefix masking, which is different from causal masking. Specifically, it has mask of the form
6562:
6469:
3426:(BERT). It is typically larger than the embedding size. For example, in both GPT-2 series and BERT series, the intermediate size of a model is 4 times its embedding size:
9575:
8985:
16304:
9776:
in CPUs, future tokens are computed concurrently, by speculating on the value of previous tokens, and are later discarded if it turns out the speculation was incorrect.
14793:
Jaegle, Andrew; Gimeno, Felix; Brock, Andrew; Zisserman, Andrew; Vinyals, Oriol; Carreira, Joao (2021-06-22). "Perceiver: General Perception with Iterative Attention".
13538:
Devlin, Jacob; Chang, Ming-Wei; Lee, Kenton; Toutanova, Kristina (11 October 2018). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding".
13295:
Choromanski, Krzysztof; Likhosherstov, Valerii; Dohan, David; Song, Xingyou; Gane, Andreea; Sarlos, Tamas; Hawkins, Peter; Davis, Jared; Mohiuddin, Afroz (2022-11-19),
13100:
Devlin, Jacob; Chang, Ming-Wei; Lee, Kenton; Toutanova, Kristina (11 October 2018). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding".
10324:
2826:
13019:
10367:
10286:
10227:
9028:
5758:
7763:
6089:
In words, it means that each token can pay attention to itself, and every token before it, but not any after it. As an example of an uncommon use of mask matrix, the
2620:
4425:
4028:
2678:
1136:
Modern Transformers overcome this problem, but unlike RNNs, they require computation time that is quadratic in the size of the context window. The linearly scaling
884:
10095:
9870:
6782:
5465:
4613:
4586:
4559:
4215:
4188:
4126:
4099:
3712:
3590:
3563:
3536:
3405:
3131:
14772:
Radford, Alec; Kim, Jong Wook; Xu, Tao; Brockman, Greg; McLeavey, Christine; Sutskever, Ilya (2022). "Robust Speech Recognition via Large-Scale Weak Supervision".
11107:{\displaystyle e^{\langle x,y\rangle /\sigma ^{2}}=\mathbb {E} \approx \langle e^{\|x\|^{2}/2\sigma ^{2}}\varphi (x),e^{\|y\|^{2}/2\sigma ^{2}}\varphi (y)\rangle }
10400:
6571:
As the encoder processes the entire input all at once, every token can attend to every other token (all-to-all attention), so there is no need for causal masking.
922:
12295:
Radford, Alec; Jong Wook Kim; Xu, Tao; Brockman, Greg; McLeavey, Christine; Sutskever, Ilya (2022). "Robust Speech Recognition via Large-Scale Weak Supervision".
5715:
9845:. However, if we had some educated guess for the values of these tokens, we could verify all of them in parallel, in one run of the model, by checking that each
8333:
1677:
15057:
Ferrando, Javier; Sarti, Gabriele; Bisazza, Arianna; Costa-jussà, Marta R. (2024-05-01). "A Primer on the Inner Workings of Transformer-based Language Models".
14421:
10247:
9890:
9171:
9005:
8748:
8728:
8578:
6154:
5778:
5735:
5689:
5438:
5350:
5166:
4532:
4512:
4492:
4472:
4445:
4395:
4335:
4315:
4255:
4235:
4068:
4048:
2594:
2574:
2311:
1802:
1782:
14620:
Zhai, Shuangfei; Talbott, Walter; Srivastava, Nitish; Huang, Chen; Goh, Hanlin; Zhang, Ruixiang; Susskind, Josh (2021-09-21). "An Attention Free Transformer".
7743:
1864:
13567:
5503:
3268:
14354:
Contribution), Woosuk Kwon*, Zhuohan Li*, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Yu, Joey Gonzalez, Hao Zhang, and Ion Stoica (* Equal (2023-06-20).
11878:
demonstrate the ability of transformers to perform a wide variety of NLP-related subtasks and their related real-world or practical applications, including:
8553:{\displaystyle {\text{RoPE}}{\big (}x,m{\big )}^{T}{\text{RoPE}}{\big (}y,n{\big )}={\text{RoPE}}{\big (}x,m+k{\big )}^{T}{\text{RoPE}}{\big (}y,n+k{\big )}}
8234:
14048:
Su, Jianlin; Lu, Yu; Pan, Shengfeng; Murtadha, Ahmed; Wen, Bo; Liu, Yunfeng (2021-04-01). "RoFormer: Enhanced Transformer with Rotary Position Embedding".
879:
12763:
Chung, Junyoung; Gulcehre, Caglar; Cho, KyungHyun; Bengio, Yoshua (2014). "Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling".
14599:
12251:
Chen, Lili; Lu, Kevin; Rajeswaran, Aravind; Lee, Kimin; Grover, Aditya; Laskin, Michael; Abbeel, Pieter; Srinivas, Aravind; Mordatch, Igor (2021-06-24),
11808:, which is then treated like an image, i.e. broken down into a series of patches, turned into vectors and treated like tokens in a standard transformer.
10181:
Training transformer-based architectures can be expensive, especially for long inputs. Many methods have been developed to attempt to address the issue.
9899:
869:
12427:
5884:{\displaystyle {\begin{aligned}{\text{MaskedAttention}}(Q,K,V)={\text{softmax}}\left(M+{\frac {QK^{\mathrm {T} }}{\sqrt {d_{k}}}}\right)V\end{aligned}}}
15552:
14750:
14136:
13336:
Esser, Patrick; Kulal, Sumith; Blattmann, Andreas; Entezari, Rahim; Müller, Jonas; Saini, Harry; Levi, Yam; Lorenz, Dominik; Sauer, Axel (2024-03-05),
7132:
Array of probability distributions, with shape (decoder vocabulary size x length(decoder output sequence)) /* encoder */ z_e ← encoder.tokenizer(t_e)
14972:
Kariampuzha, William; Alyea, Gioconda; Qu, Sue; Sanjak, Jaleal; Mathé, Ewy; Sid, Eric; Chatelaine, Haley; Yadaw, Arjun; Xu, Yanji; Zhu, Qian (2023).
13040:
Peng, Bo; Alcaide, Eric; Anthony, Quentin; Albalak, Alon; Arcadinho, Samuel; Biderman, Stella; Cao, Huanqi; Cheng, Xin; Chung, Michael (2023-12-10),
12914:
9030:
represents no attention paid, the linear bias matrix increases attention paid in one direction and decreases attention paid in the other direction.
7023:
6875:
2834:
2016:
16146:
14513:
14379:
12978:
Parikh, Ankur P.; Täckström, Oscar; Das, Dipanjan; Uszkoreit, Jakob (2016-09-25). "A Decomposable Attention Model for Natural Language Inference".
11659:{\displaystyle {\text{Attention}}(Q,K,V)={\text{softmax}}\left({\frac {QK^{\mathrm {T} }}{\sqrt {d_{k}}}}\right)V\approx Q(K^{T}V/{\sqrt {d_{k}}})}
8703:{\displaystyle {\begin{aligned}{\text{Attention}}(Q,K,V)={\text{softmax}}\left({\frac {QK^{\mathrm {T} }}{\sqrt {d_{k}}}}+sB\right)V\end{aligned}}}
710:
14069:
Press, Ofir; Smith, Noah A.; Lewis, Mike (2021-08-01). "Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation".
13509:
Tay, Yi; Dehghani, Mostafa; Tran, Vinh Q.; Garcia, Xavier; Wei, Jason; Wang, Xuezhi; Chung, Hyung Won; Shakeri, Siamak; Bahri, Dara (2023-02-28),
9530:{\displaystyle {\text{MultiheadedAttention}}(Q,K,V)={\text{Concat}}_{i\in }\left({\text{Attention}}(XW_{i}^{Q},XW_{i}^{K},XW_{i}^{V})\right)W^{O}}
9146:{\displaystyle {\begin{aligned}{\text{Attention}}(Q,K,V)={\text{softmax}}\left({\frac {QK^{\mathrm {T} }}{\sqrt {d_{k}}}}+B\right)V\end{aligned}}}
2227:
1268:
you need". That hypothesis was against conventional wisdom of the time, and even his father, a well-known computational linguist, was skeptical.
13385:
Raffel, Colin; Shazeer, Noam; Roberts, Adam; Lee, Katherine; Narang, Sharan; Matena, Michael; Zhou, Yanqi; Li, Wei; Liu, Peter J. (2020-01-01).
12571:
12227:
Luong, Minh-Thang; Pham, Hieu; Manning, Christopher D. (August 17, 2015). "Effective Approaches to Attention-based Neural Machine Translation".
4711:{\displaystyle {\begin{aligned}{\text{Attention}}(Q,K,V)={\text{softmax}}\left({\frac {QK^{\mathrm {T} }}{\sqrt {d_{k}}}}\right)V\end{aligned}}}
13809:
917:
17:
14948:
Yu, Jiahui; Xu, Yuanzhong; Koh, Jing Yu; Luong, Thang; Baid, Gunjan; Wang, Zirui; Vasudevan, Vijay; Ku, Alexander; Yang, Yinfei (2022-06-21),
14285:
Kwon, Woosuk; Li, Zhuohan; Zhuang, Siyuan; Sheng, Ying; Zheng, Lianmin; Yu, Cody Hao; Gonzalez, Joseph; Zhang, Hao; Stoica, Ion (2023-10-23).
11775:
Transformers can also be used/adapted for modalities (input or output) beyond just text, usually by finding a way to "tokenize" the modality.
12892:
Wu, Yonghui; et al. (2016-09-01). "Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation".
12700:
Cho, Kyunghyun; van Merriënboer, Bart; Gulcehre, Caglar; Bahdanau, Dzmitry; Bougares, Fethi; Schwenk, Holger; Bengio, Yoshua (October 2014).
6837:
1960:
The token embedding vectors are added to their respective positional encoding vectors (see below), producing the sequence of input vectors.
1137:
13065:
10421:
11855:
10699:
874:
725:
13760:
Raffel, Colin; Shazeer, Noam; Roberts, Adam; Lee, Katherine; Narang, Sharan; Matena, Michael; Zhou, Yanqi; Li, Wei; Liu, Peter J. (2020).
13170:
4996:
9730:{\displaystyle {\text{MultiQueryAttention}}(Q,K,V)={\text{Concat}}_{i\in }\left({\text{Attention}}(XW_{i}^{Q},XW^{K},XW^{V})\right)W^{O}}
456:
4454:, which is useful for training due to computational matrix operation optimizations that quickly compute matrix operations. The matrices
13320:
12206:
Bahdanau; Cho, Kyunghyun; Bengio, Yoshua (September 1, 2014). "Neural Machine Translation by Jointly Learning to Align and Translate".
5671:
It may be necessary to cut out attention links between some word-pairs. For example, the decoder, when decoding for the token position
5325:{\displaystyle {\text{MultiheadedAttention}}(Q,K,V)={\text{Concat}}_{i\in }({\text{Attention}}(XW_{i}^{Q},XW_{i}^{K},XW_{i}^{V}))W^{O}}
957:
760:
4943:
4930:{\displaystyle \ell _{\text{seq, key}}=\ell _{\text{seq, value}},\;d_{\text{query}}=d_{\text{key}},\;d_{\text{value}}=d_{\text{head}}}
14641:
Peng, Hao; Pappas, Nikolaos; Yogatama, Dani; Schwartz, Roy; Smith, Noah A.; Kong, Lingpeng (2021-03-19). "Random Feature Attention".
12871:
Luong, Minh-Thang; Pham, Hieu; Manning, Christopher D. (2015). "Effective Approaches to Attention-based Neural Machine Translation".
5606:
3595:
The module takes three sequences, a query sequence, a key sequence, and a value sequence. The query sequence is a sequence of length
2213:
the words are in the input sequence. Without positional encoding, the model would be unable to process input sequence as more than a
15031:
14924:
Ramesh, Aditya; Pavlov, Mikhail; Goh, Gabriel; Gray, Scott; Voss, Chelsea; Radford, Alec; Chen, Mark; Sutskever, Ilya (2021-02-26),
1585:{\displaystyle {\text{Loss}}=-\sum _{t\in {\text{masked tokens}}}\ln({\text{probability of }}t{\text{ conditional on its context}})}
1171:
is another LSTM that converts the vector into a sequence of tokens. Similarly, (Cho et al, 2014) was 130M-parameter model that used
10415:
1649:
The following description follows exactly the Transformer as described in the original paper. There are variants, described in the
15662:
13602:
836:
14535:
13873:
13699:
12839:
12784:
Gruber, N.; Jockisch, A. (2020), "Are GRU cells more specific and LSTM cells more sensitive in motive classification of text?",
1190:(Bahdanau et al, 2014) introduced an attention mechanism to seq2seq for machine translation to solve the bottleneck problem (of
15545:
13202:
7479:{\displaystyle M_{\text{prefixLM}}={\begin{bmatrix}\mathbf {0} &-\infty \\\mathbf {0} &M_{\text{causal}}\end{bmatrix}}}
6805:(a) One encoder layer and one decoder layer. (b) Two encoder layers and two decoder layers. The sublayers are labelled as well.
1698:
output side, the output tokens are parsed back to text. The module doing the conversion between token sequences and texts is a
385:
16469:
14316:
12611:
Christoph von der Malsburg: The correlation theory of brain function. Internal Report 81-2, MPI Biophysical Chemistry, 1981.
12501:
12182:
3775:
13003:
12088:
9751:
method saves the computed key and value vectors at each attention block, so that they are not recomputed at each new token.
3960:
2109:
16335:
14218:"Introducing Together AI Chief Scientist Tri Dao, as he releases FlashAttention-2 to speed up model training and inference"
9331:
of a GPU, and by careful management of the blocks it minimizes data copying between GPU caches (as data movement is slow).
1869:
1008:
894:
657:
192:
12624:
Jerome A. Feldman, "Dynamic connections in neural networks," Biological Cybernetics, vol. 46, no. 1, pp. 27-39, Dec. 1982.
2628:
2489:
16436:
15987:
15724:
15330:
15094:
14435:
Chen, Charlie; Borgeaud, Sebastian; Irving, Geoffrey; Lespiau, Jean-Baptiste; Sifre, Laurent; Jumper, John (2023-02-02),
11797:
1393:(instead of after) multiheaded attention and feedforward layers stabilizes training, not requiring learning rate warmup.
1141:
network which computes answers to queries. This was later shown to be equivalent to the unnormalized linear Transformer.
912:
13677:
8596:
positional encoder that is directly plugged into the attention mechanism. Specifically, the ALiBi attention mechanism is
3144:
1108:
leaves the model's state at the end of a long sentence without precise, extractable information about preceding tokens.
15476:
12000:
7372:
1645:
Un-embedding layer, which converts the final vector representations back to a probability distribution over the tokens.
1601:
1327:
1293:
1104:(1990). In theory, the information from one token can propagate arbitrarily far down the sequence, but in practice the
1070:
745:
720:
669:
5500:
As an example, in the smallest GPT-2 model, there are only self-attention mechanisms. It has the following dimensions:
1000:
16248:
15875:
15682:
15538:
13121:
9323:
FlashAttention is an algorithm that implements the transformer attention mechanism efficiently on a GPU. It performs
8338:
8156:
3717:
1260:
with an order of magnitude less parameters than LSTMs. One of its authors, Jakob Uszkoreit, suspected that attention
1209:
793:
788:
441:
12742:
Sutskever, Ilya; Vinyals, Oriol; Le, Quoc Viet (14 Dec 2014). "Sequence to sequence learning with neural networks".
5073:
3429:
3140:. In the author's words, "we hypothesized it would allow the model to easily learn to attend by relative position."
1015:
Transformers have the advantage of having no recurrent units, and therefore require less training time than earlier
16203:
13559:
9287:. This is contrasted with the original sinusoidal positional encoding, which is an "absolute positional encoding".
5355:
1238:
451:
89:
12490:
Parallel Distributed Processing, Volume 1: Explorations in the Microstructure of Cognition: Foundations, Chapter 2
6587:
A decoder consists of an embedding layer, followed by multiple decoder layers, followed by an un-embedding layer.
5891:
A non-masked attention module can be thought of as a masked attention module where the mask has all entries zero.
3099:{\displaystyle \sum _{j}c_{j}f(t+\Delta t_{j})=\left(\sum _{j}c_{j}\,\mathrm {diag} (f(\Delta t_{j}))\right)f(t)}
1735:. When faced with tokens outside the vocabulary, typically a special token is used, written as "" for "unknown".
1213:
14090:
Shaw, Peter; Uszkoreit, Jakob; Vaswani, Ashish (2018). "Self-Attention with Relative Position Representations".
12383:
Mariama; Lhoest, Quentin; Rush, Alexander (2020). "Transformers: State-of-the-Art Natural Language Processing".
10169:
is completely discarded. The process then repeats (starting from the 4th token) until all tokens are generated.
2831:
The main reason for using this positional encoding function is that using it, shifts are linear transformations:
16390:
16330:
15928:
3598:
1433:
950:
846:
610:
431:
14591:
11479:
7396:
7368:
7348:
The Transformer architecture, being modular, allows variations. Several common variations are described here.
6096:
5894:
For example, the following matrix is commonly used in decoder self-attention modules, called "causal masking":
3143:
In typical implementations, all operations are done over the real numbers, not the complex numbers, but since
2922:
1194:
output vector), allowing the model to process long-distance dependencies more easily. They called their model
15923:
15612:
12385:
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
9783:
3861:
2479:{\displaystyle (f(t)_{2k},f(t)_{2k+1})=(\sin(\theta ),\cos(\theta ))\quad \forall k\in \{0,1,\ldots ,d/2-1\}}
821:
523:
299:
12419:
6968:
1175:(GRU) instead of LSTM. Later research showed that GRUs are neither better nor worse than LSTMs for seq2seq.
16365:
15762:
15719:
15672:
15667:
11935:
9296:
3818:
3625:
3508:
3134:
1711:
1354:
1086:
992:
778:
715:
625:
603:
446:
436:
31:
13451:
12930:
5141:, which allows for fast processing. The outputs for the attention layer are concatenated to pass into the
5062:
3655:
16416:
15712:
15638:
15408:
11898:
11835:
11430:
10185:(2020) is a standard benchmark for comparing the behavior of transformer architectures over long inputs.
7392:
7376:
7364:
6846:
3408:
1409:
1331:
1178:
These early seq2seq models had no attention mechanism, and the state vector is accessible only after the
1046:
929:
841:
826:
287:
109:
13319:
Liu, Zhuang; Mao, Hanzi; Wu, Chao-Yuan; Feichtenhofer, Christoph; Darrell, Trevor; Xie, Saining (2022).
11527:
first, then multiply it with the query. In essence, we have managed to obtain a more precise version of
8402:
The benefit of RoPE is that the dot-product between two vectors depends on their relative location only:
16040:
15975:
15576:
14505:
11721:
10654:
10327:
9180:
5142:
5046:. It is theoretically possible for all three to be different, but that is rarely the case in practice.
3907:
3262:
3188:
1412:
on a small task-specific dataset. The pretrain dataset is typically an unlabeled large corpus, such as
1366:
1249:
1105:
889:
816:
566:
461:
249:
182:
142:
12449:
11783:. The LLaVA was a vision-language model composed of a language model (Vicuna-13B) and a vision model (
10136:
10100:
10037:
10001:
4789:
4729:
4340:
4260:
1504:
16441:
16299:
15938:
15769:
15592:
15519:
15465:
13698:
Yang, Zhilin; Dai, Zihang; Yang, Yiming; Carbonell, Jaime; Salakhutdinov, Russ R; Le, Quoc V (2019).
11921:
Beyond traditional NLP, the transformer architecture has had success in other applications, such as:
11871:
11669:
10602:
7551:
RoPE (rotary positional embedding), is best explained by considering a list of 2-dimensional vectors
6821:
6813:
5574:
5473:
4819:
4131:
1272:
996:
943:
549:
317:
187:
14885:
13801:
13654:
Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP
13146:
7264:
z_d ← layer.layer_norm(z_d) z_d ← layer.masked_multiheaded_attention(z_d, z_d, z_d)
4759:
4450:
The attention calculation for all tokens can be expressed as one large matrix calculation using the
3233:
3206:
1974:
16464:
16340:
15597:
15509:
15201:
13478:
11903:
11764:
9324:
3496:
2171:
1402:
1097:
1016:
571:
491:
414:
332:
162:
124:
119:
79:
74:
6545:
6452:
16385:
16370:
16023:
16018:
15918:
15786:
15567:
15420:
15087:
11893:
11822:
9540:
8945:
6854:
518:
367:
267:
94:
14462:
Kitaev, Nikita; Kaiser, Łukasz; Levskaya, Anselm (2020). "Reformer: The Efficient Transformer".
7122:
The following is the pseudocode for a standard pre-LN encoder-decoder Transformer, adapted from
6591:
information from the encodings generated by the encoders. This mechanism can also be called the
16345:
16105:
15824:
15819:
15340:
15195:
12702:"Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation"
10291:
9242:
7528:
7512:
2790:
1638:
Embedding layer, which converts tokens and positions of the tokens into vector representations.
1160:
1112:
1058:
1020:
698:
674:
576:
337:
312:
272:
84:
14005:
Gehring, Jonas; Auli, Michael; Grangier, David; Yarats, Denis; Dauphin, Yann N. (2017-07-17).
12706:
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)
10336:
10255:
10196:
9739:
This has a neutral effect on model quality and training speed, but increases inference speed.
9010:
7288:
z_d ← layer.layer_norm(z_d) z_d ← layer.multiheaded_attention(z_d, z_e, z_e)
5740:
16375:
16360:
16325:
16013:
15913:
15781:
15396:
15207:
14872:
12615:
See Reprint in Models of Neural Networks II, chapter 2, pages 95-119. Springer, Berlin, 1994.
12026:
11994:
11863:
9773:
7748:
7356:
3478:
2599:
1593:
1320:
1074:
652:
474:
426:
282:
197:
69:
16243:
13624:
12567:
12117:
4400:
4128:. The attention weights are divided by the square root of the dimension of the key vectors,
4003:
3511:
units. For each unit, the transformer model learns three weight matrices: the query weights
3163:
16395:
16350:
15796:
15741:
15587:
15582:
15261:
14026:
Transformer Language Models without Positional Encodings Still Learn Positional Information
12658:
11988:
11909:
11839:
10073:
9848:
6760:
6600:
6535:{\displaystyle {\text{EncoderLayer}}(H)={\text{FFN}}({\text{MultiheadedAttention}}(H,H,H))}
5443:
4591:
4564:
4537:
4193:
4166:
4104:
4077:
3690:
3568:
3541:
3514:
3390:
3109:
1339:
1172:
1024:
581:
531:
10376:
7523:
The normalization used in the Transformer can be different from LayerNorm. One example is
2953:
By taking a linear sum, any convolution can also be implemented as linear transformations:
1271:
In 2017, the original (100M-sized) encoder-decoder transformer model was proposed in the "
8:
15970:
15948:
15697:
15692:
15650:
15602:
15471:
15028:
14685:
12677:
12288:
12006:
11957:
11945:
11882:
10370:
7508:
7400:
7252:
layer ← decoder.layers /* first sublayer */ z_d_copy ← copy(z_d)
6858:
5694:
3155:
1609:
1480:
1450:
1406:
1387:
1350:
1280:
1116:
1042:
684:
620:
591:
496:
322:
255:
241:
227:
202:
152:
104:
64:
12518:
8315:
7276:
z_d ← z_d + z_d_copy /* second sublayer */ z_d_copy ← copy(z_d)
5470:
It is theoretically possible for each attention head to have a different head dimension
3199:
1659:
16355:
15933:
15080:
15058:
15043:
15000:
14973:
14953:
14929:
14904:
14816:
14794:
14773:
14726:"Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality | LMSYS Org"
14664:
14642:
14621:
14572:
14547:
14485:
14463:
14440:
14401:
14294:
14265:
14243:
14148:
14116:
14091:
14070:
14049:
14029:
13987:
13959:
13918:
13885:
13852:
13831:
13773:
13736:
13711:
13657:
13539:
13514:
13428:
13398:
13365:
13341:
13300:
13275:
13250:
13101:
13045:
12979:
12893:
12872:
12851:
12821:
12808:
12764:
12743:
12709:
12594:
12396:
12359:
12296:
12277:
12256:
12228:
12207:
12153:
11982:
11801:
11790:
11784:
10232:
9875:
9156:
8990:
8733:
8713:
8563:
7300:
z_d ← z_d + z_d_copy /* third sublayer */ z_d_copy ← copy(z_d)
6157:
6139:
5763:
5720:
5674:
5423:
5335:
5151:
5138:
5054:
4517:
4497:
4477:
4457:
4430:
4380:
4320:
4300:
4240:
4220:
4053:
4033:
4000:
Attention weights are calculated using the query and key vectors: the attention weight
3488:
2579:
2559:
2296:
2214:
1787:
1767:
1739:
1438:
1428:
1413:
1346:
1312:
1257:
1054:
662:
586:
372:
167:
14293:. SOSP '23. New York, NY, USA: Association for Computing Machinery. pp. 611–626.
14192:
13648:
Clark, Kevin; Khandelwal, Urvashi; Levy, Omer; Manning, Christopher D. (August 2019).
12957:
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing
12488:
12465:
7554:
1807:
1198:, as it "emulates searching through a source sentence during decoding a translation".
1159:(Sutskever et al, 2014) was a 380M-parameter model for machine translation using two
16421:
16409:
16213:
15865:
15736:
15729:
15160:
15005:
14707:
14373:
14312:
13991:
13979:
13783:
13594:
13408:
13386:
13073:
13011:
12922:
12825:
12813:
12546:
12538:
12497:
12469:
12400:
12392:
12337:
12145:
12137:
11953:
11925:
11780:
10193:
The standard attention graph is either all-to-all or causal, both of which scales as
9367:
Multi-Query Attention changes the multiheaded attention mechanism. Whereas normally,
9328:
7174:
z_e ← layer.layer_norm(z_e) z_e ← layer.multiheaded_attention(z_e, z_e, z_e)
1420:
1066:
755:
598:
511:
307:
277:
222:
217:
172:
114:
14006:
13223:
13194:
12598:
1345:
Since 2020, Transformers have been applied in modalities beyond text, including the
1041:
Transformers were first developed as an improvement over previous architectures for
16166:
16156:
15963:
15757:
15707:
15702:
15645:
15633:
15491:
15379:
15367:
14995:
14985:
14697:
14304:
14217:
14168:
13969:
13928:
13851:
Hendrycks, Dan; Gimpel, Kevin (2016-06-27). "Gaussian Error Linear Units (GELUs)".
13762:"Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer"
13667:
13387:"Exploring the limits of transfer learning with a unified text-to-text transformer"
12960:
12803:
12793:
12719:
12586:
12530:
12461:
12420:"Open Sourcing BERT: State-of-the-Art Pre-training for Natural Language Processing"
12388:
12329:
12157:
12129:
9991:{\displaystyle {\tilde {x}}_{1},{\tilde {x}}_{2},{\tilde {x}}_{3},{\tilde {x}}_{4}}
7352:
6842:
6801:
4451:
4160:
2596:
that would be input into the positional encoding function. The original paper uses
2010:
1699:
1692:
1454:
1362:
1253:
1205:
783:
536:
486:
396:
380:
350:
212:
207:
157:
147:
45:
14287:"Efficient Memory Management for Large Language Model Serving with PagedAttention"
14193:"FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning"
12657:
Katharopoulos, Angelos; Vyas, Apoorv; Pappas, Nikolaos; Fleuret, François (2020).
12612:
12174:
16279:
16223:
16045:
15687:
15607:
15313:
15124:
15035:
14865:"Phenaki: Variable Length Video Generation from Open Domain Textual Descriptions"
14286:
12113:
11930:
9174:
7324:
z_d ← z_d + z_d_copy z_d ← decoder.final_layer_norm(z_d) output_distributions ←
6829:
6579:
6168:
1217:
1050:
970:
811:
615:
481:
421:
11476:
This approximation can be computed in linear time, as we can compute the matrix
6176:
An encoder consists of an embedding layer, followed by multiple encoder layers.
1264:
recurrence is sufficient for language translation, thus the title "attention is
16253:
16218:
16208:
16033:
15791:
15617:
15225:
14990:
13915:
Proceedings of the 16th International Conference on Spoken Language Translation
12659:"Transformers are RNNs: Fast autoregressive Transformers with linear attention"
12317:
12133:
12080:
11915:
7222:
z_e ← encoder.final_layer_norm(z_e) /* decoder */ z_d ← decoder.tokenizer(t_d)
3137:
2780:{\displaystyle f(t)=\left(e^{it/r^{k}}\right)_{k=0,1,\ldots ,{\frac {d}{2}}-1}}
1751:
1443:
1308:
1062:
1004:
831:
362:
99:
14592:"Constructing Transformers For Longer Sequences with Sparse Attention Methods"
14536:"The Reversible Residual Network: Backpropagation Without Storing Activations"
14135:
Dao, Tri; Fu, Dan; Ermon, Stefano; Rudra, Atri; Ré, Christopher (2022-12-06).
13325:. Conference on Computer Vision and Pattern Recognition. pp. 11976–11986.
12959:. Austin, Texas: Association for Computational Linguistics. pp. 551–561.
12708:. Doha, Qatar: Association for Computational Linguistics. pp. 1724–1734.
12590:
3145:
complex multiplication can be implemented as real 2-by-2 matrix multiplication
1705:
The set of all tokens is the vocabulary of the tokenizer, and its size is the
1167:
is an LSTM that takes in a sequence of tokens and turns it into a vector. The
16458:
16198:
16178:
16095:
15774:
15373:
15273:
14864:
14711:
14702:
14137:"FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness"
13983:
13787:
13412:
13077:
13015:
12926:
12542:
12487:
Rumelhart, David E.; McClelland, James L.; Hinton, Geoffrey E. (1987-07-29).
12473:
12341:
12333:
12141:
9756:
7162:
layer ← encoder.layers /* first sublayer */ z_e_copy ← copy(z_e)
5467:
is a final projection matrix owned by the whole multi-headed attention head.
2625:
The function is in a simpler form when written as a complex function of type
2218:
1514:
1461:
restoring or repairing incomplete or corrupted text. For example, the input,
1101:
984:
750:
679:
561:
292:
177:
14355:
14308:
13649:
12798:
12635:
12572:"Learning to control fast-weight memories: an alternative to recurrent nets"
11859:
10070:
are accepted. The same run of the large model already generated a new token
7186:
z_e ← z_e + z_e_copy /* second sublayer */ z_e_copy ← copy(z_e)
5717:. This may be accomplished before the softmax stage by adding a mask matrix
1615:
Note that "masked" as in "masked language modelling" is not "masked" as in "
1045:, but have found many applications since then. They are used in large-scale
16284:
16115:
15530:
15486:
15166:
15119:
15042:
Phuong, Mary; Hutter, Marcus (2022). "Formal Algorithms for Transformers".
15009:
14725:
13932:
13911:"Transformers without Tears: Improving the Normalization of Self-Attention"
13910:
13672:
12964:
12817:
12550:
9312:
4993:. If the attention head is used in a cross-attention fashion, then usually
2286:{\displaystyle f:\mathbb {R} \to \mathbb {R} ^{d};d\in \mathbb {Z} ,d>0}
1757:
1688:
1035:
14024:
Haviv, Adi; Ram, Ori; Press, Ofir; Izsak, Peter; Levy, Omer (2022-12-05),
13700:"XLNet: Generalized Autoregressive Pretraining for Language Understanding"
12723:
12149:
11114:
Consequently, the one-headed attention, with one query, can be written as
8592:
for the positional encoder on the original transformer. Instead, it is an
4843:. The attention mechanism requires the following three equalities to hold:
1096:
For many years, sequence modelling and generation was done by using plain
16380:
16151:
16060:
16055:
15677:
15655:
15481:
15443:
14974:"Precision information extraction for rare disease epidemiology at scale"
13974:
13947:
13761:
12534:
12318:"Learning to Throw With a Handful of Samples Using Decision Transformers"
12084:
11887:
11805:
11787:-L/14), connected by a linear layer. Only the linear layer is finetuned.
9356:
9336:
8987:. The idea being that the linear bias matrix is a softened mask. Just as
7539:
Transformers may use other positional encoding methods than sinusoidal.
7312:
z_d ← layer.layer_norm(z_d) z_d ← layer.feedforward(z_d)
4071:
3505:
2576:
is a free parameter that should be significantly larger than the biggest
1150:
1119:
which used neurons that multiply the outputs of other neurons, so-called
1031:
556:
50:
14684:
Lu, Kevin; Grover, Aditya; Abbeel, Pieter; Mordatch, Igor (2022-06-28).
14567:
Child, Rewon; Gray, Scott; Radford, Alec; Sutskever, Ilya (2019-04-23),
12519:"Learning, invariance, and generalization in high-order neural networks"
12038:
Some architectures, such as RWKV or state space models, avoid the issue.
8335:-dimensional vectors, a RoPE encoder is defined by a sequence of angles
5564:{\displaystyle d_{\text{emb}}=768,n_{\text{head}}=12,d_{\text{head}}=64}
3504:
The attention mechanism used in the Transformer architecture are scaled
3380:{\displaystyle \mathrm {FFN} (x)=\phi (xW^{(1)}+b^{(1)})W^{(2)}+b^{(2)}}
16274:
16233:
16228:
16141:
16050:
15958:
15870:
15850:
15335:
15233:
14950:
Scaling Autoregressive Models for Content-Rich Text-to-Image Generation
14836:
14332:
13656:. Florence, Italy: Association for Computational Linguistics: 276–286.
13338:
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
13171:"The inside story of how ChatGPT was built from the people who made it"
11948:
chess board positions. Using static evaluation alone (that is, with no
9300:
8305:{\displaystyle {\text{RoPE}}{\big (}z_{m},m{\big )}=e^{im\theta }z_{m}}
8153:
Equivalently, if we write the 2-dimensional vectors as complex numbers
3167:
A Transformer is composed of stacked encoder layers and decoder layers.
1518:
705:
401:
327:
14749:
Liu, Haotian; Li, Chunyuan; Wu, Qingyang; Lee, Yong Jae (2023-12-15).
14534:
Gomez, Aidan N; Ren, Mengye; Urtasun, Raquel; Grosse, Roger B (2017).
6872:
convention. In the post-LN convention, the output of each sublayer is
4217:
are different matrices allows attention to be non-symmetric: if token
3203:
The feedforward network module. It is a two-layered network that maps
2160:
16269:
16238:
16136:
15980:
15943:
15880:
15834:
15829:
15814:
15426:
15255:
15172:
15103:
14422:"Towards 100x Speedup: Full Stack Transformer Inference Optimization"
13857:
13544:
13247:
13106:
12952:
12364:
11976:
11939:
11811:
10806:{\displaystyle \mathbb {E} =e^{-{\frac {\|x-y\|^{2}}{2\sigma ^{2}}}}}
9315:
that supplies transformer-based architectures and pretrained models.
9295:
The transformer model has been implemented in standard deep learning
9034:
original transformer, as well as RoPE and many others, are located).
6471:
stands for "feed-forward network". We can more succinctly write it as
3261:
The feedforward network (FFN) modules in a Transformer are 2-layered
2165:
1028:
864:
645:
14862:
14437:
Accelerating Large Language Model Decoding with Speculative Sampling
14011:
Proceedings of the 34th International Conference on Machine Learning
12701:
12680:(2021). "Linear Transformers Are Secretly Fast Weight Programmers".
12316:
Monastirsky, Maxim; Azulay, Osher; Sintov, Avishai (February 2023).
12282:
Proceedings of the 37th International Conference on Machine Learning
9772:
compute power by computing several tokens in parallel. Similarly to
5039:{\displaystyle X_{\text{query}}\neq X_{\text{key}}=X_{\text{value}}}
4397:
is the weighted sum of the value vectors of all tokens, weighted by
16171:
16003:
15267:
15063:
15048:
14958:
14934:
14909:
14821:
14799:
14778:
14669:
14647:
14626:
14577:
14552:
14490:
14468:
14445:
14406:
14299:
14270:
14248:
14153:
14121:
14096:
14075:
14054:
14034:
13964:
13923:
13890:
13836:
13778:
13741:
13716:
13662:
13560:"Sequence Modeling with Neural Networks (Part 2): Attention Models"
13519:
13433:
13403:
13370:
13346:
13305:
13280:
13255:
13050:
12984:
12898:
12877:
12301:
12261:
12233:
6784:
is the matrix with rows being the output vectors from the encoder.
4159:, which stabilizes gradients during training, and passed through a
1316:
14240:
12856:
12769:
12748:
12714:
12699:
12640:
Proceedings of the Annual Meeting of the Cognitive Science Society
12253:
Decision Transformer: Reinforcement Learning via Sequence Modeling
12212:
9042:
Relative Position Encodings is similar to ALiBi, but more generic:
4720:
where the softmax is applied over each of the rows of the matrix.
1353:. The vision transformer, in turn, stimulated new developments in
1003:, and each token is converted into a vector via looking up from a
16294:
16131:
16085:
16008:
15908:
15903:
15855:
15301:
15154:
14686:"Frozen Pretrained Transformers as Universal Computation Engines"
14661:
14291:
Proceedings of the 29th Symposium on Operating Systems Principles
13946:
Dufter, Philipp; Schmitt, Martin; Schütze, Hinrich (2022-06-06).
13294:
12075:
11970:
11949:
11875:
11473:. Similarly for multiple queries, and for multiheaded attention.
9304:
8399:. Then the RoPE encoding is applied to each pair of coordinates.
7524:
7198:
z_e ← layer.layer_norm(z_e) z_e ← layer.feedforward(z_e)
3172:
2314:
1761:
1335:
1297:
1276:
1275:" paper. At the time, the focus of the research was on improving
640:
13830:
Shazeer, Noam (2020-02-01). "GLU Variants Improve Transformer".
13004:"8 Google Employees Invented Modern AI. Here's the Inside Story"
12704:. In Moschitti, Alessandro; Pang, Bo; Daelemans, Walter (eds.).
12073:
12071:
12069:
12067:
12065:
12063:
12061:
12059:
12057:
12055:
4986:{\displaystyle X_{\text{query}}=X_{\text{key}}=X_{\text{value}}}
4940:
If the attention head is used in a self-attention fashion, then
2317:. The full positional encoding defined in the original paper is:
1463:"Thank you ~~ me to your party ~~ week",
16309:
16289:
16161:
15953:
15178:
15114:
15056:
12083:; Shazeer, Noam; Parmar, Niki; Uszkoreit, Jakob; Jones, Llion;
11818:
7237:
z_d ← decoder.embedding(z_d) + decoder.positional_embedding(t)
7147:
z_e ← encoder.embedding(z_e) + encoder.positional_embedding(t)
5656:{\displaystyle W^{O}\in \mathbb {R} ^{(64\times 12)\times 768}}
2221:" and "dog bites man" would be processed exactly the same way.
1358:
1123:. Neural networks using multiplicative units were later called
1023:(LSTM). Later variations have been widely adopted for training
988:
391:
27:
Machine learning algorithm used for natural-language processing
12656:
12357:
12294:
12009: – Series of large language models developed by Google AI
10333:
Sparse attention uses attention graphs that grows slower than
7106:{\displaystyle x+\mathrm {Sublayer} (\mathrm {LayerNorm} (x))}
6958:{\displaystyle \mathrm {LayerNorm} (x+\mathrm {Sublayer} (x))}
2912:{\displaystyle f(t+\Delta t)=\mathrm {diag} (f(\Delta t))f(t)}
2099:{\displaystyle \mathrm {UnEmbed} (x)=\mathrm {softmax} (xW+b)}
1307:(2018) was a bi-directional LSTM that produces contextualized
16110:
16090:
16080:
16075:
16070:
16065:
16028:
15860:
15402:
14690:
Proceedings of the AAAI Conference on Artificial Intelligence
14396:
Leviathan, Yaniv; Kalman, Matan; Matias, Yossi (2023-05-18),
14356:"vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention"
13195:"Improving language understanding with unsupervised learning"
12756:
12079:
12052:
11985: – Variant of Transformer designed for vision processing
11867:
11851:
11847:
11843:
11718:
are first independently sampled from the normal distribution
11666:
Performer (2022) uses the same Random Feature Attention, but
6090:
5420:
are "projection matrices" owned by individual attention head
5066:
Exact dimension counts within a multiheaded attention module.
1592:
and the model is trained to minimize this loss function. The
635:
630:
357:
15072:
14813:
12951:
Cheng, Jianpeng; Dong, Li; Lapata, Mirella (November 2016).
4377:
could be small). The output of the attention unit for token
1061:, audio, multi-modal processing, robotics, and even playing
16100:
15414:
13335:
12977:
11979: – Variant of Transformer designed for multimodal data
9872:
is indeed the token with the largest log-likelihood in the
9344:
9340:
7505:
5148:
Concretely, let the multiple attention heads be indexed by
1304:
14792:
14765:
14640:
14619:
14262:
13647:
11814:
are a variant of Transformers designed for multimodality.
9998:. These tokens are run through the larger model, and only
5352:
is the concatenation of word embeddings, and the matrices
3407:
is its activation function. The original Transformer used
1963:
The number of dimensions in an embedding vector is called
1596:
are trained for masked token prediction and another task.
1416:. Tasks for pretraining and fine-tuning commonly include:
1077:(bidirectional encoder representations from transformers).
14901:
14566:
14434:
14398:
Fast Inference from Transformers via Speculative Decoding
14004:
13650:"What Does BERT Look at? An Analysis of BERT's Attention"
13039:
12486:
9347:), a 2x speed increase over the original FlashAttention.
2224:
The positional encoding is defined as a function of type
999:". Text is converted to numerical representations called
14971:
13917:. Hong Kong: Association for Computational Linguistics.
11997: – Series of language models developed by Google AI
8231:, then RoPE encoding is just multiplication by an angle:
7020:
In the pre-LN convention, the output of each sublayer is
3500:
Exact dimension counts within an attention head module.
1338:, became unexpectedly popular, triggering a boom around
1330:
of decoder-only Transformers became state of the art in
923:
List of datasets in computer vision and image processing
14113:
Rethinking Positional Encoding in Language Pre-training
13759:
13537:
13425:
13384:
13099:
12315:
12274:
1756:
Each token is converted into an embedding vector via a
14683:
12762:
10592:{\displaystyle \varphi (x)={\frac {1}{\sqrt {D}}}^{T}}
8772:
7990:
7925:
7850:
7431:
6323:
6251:
5919:
3990:{\displaystyle d_{\text{emb, query}}=d_{\text{query}}}
3772:. The matrix of all query vectors is the query matrix:
2145:{\displaystyle (d_{\text{emb}},n_{\text{vocabulary}})}
14533:
14395:
13508:
13318:
13122:"Google: BERT now used on almost every English query"
12953:"Long Short-Term Memory-Networks for Machine Reading"
12675:
12381:
12278:"Stabilizing Transformers for Reinforcement Learning"
12112:
11724:
11672:
11533:
11482:
11433:
11120:
10819:
10702:
10657:
10651:
are independent samples from the normal distribution
10605:
10424:
10379:
10339:
10294:
10258:
10235:
10199:
10139:
10103:
10076:
10040:
10004:
9902:
9878:
9851:
9786:
9585:
9543:
9375:
9245:
9183:
9159:
9048:
9013:
8993:
8948:
8760:
8736:
8716:
8602:
8566:
8408:
8341:
8318:
8237:
8159:
7771:
7751:
7557:
7412:
7384:
7114:
requiring no warm-up, leading to faster convergence.
7026:
6971:
6878:
6763:
6616:
6548:
6477:
6455:
6186:
6142:
6099:
5900:
5786:
5766:
5760:
at entries where the attention link must be cut, and
5743:
5723:
5697:
5677:
5609:
5577:
5506:
5476:
5446:
5426:
5358:
5338:
5174:
5154:
5076:
4999:
4946:
4849:
4822:
4792:
4762:
4732:
4621:
4594:
4567:
4540:
4520:
4500:
4480:
4460:
4433:
4403:
4383:
4343:
4323:
4303:
4297:
is large), this does not necessarily mean that token
4263:
4243:
4223:
4196:
4169:
4134:
4107:
4080:
4056:
4036:
4006:
3963:
3910:
3864:
3821:
3778:
3720:
3693:
3658:
3628:
3601:
3571:
3544:
3517:
3432:
3393:
3271:
3236:
3209:
3112:
2959:
2925:
2837:
2793:
2681:
2631:
2602:
2582:
2562:
2492:
2323:
2299:
2230:
2174:
2112:
2019:
1977:
1872:
1810:
1790:
1770:
1714:
1662:
1527:
14837:"Parti: Pathways Autoregressive Text-to-Image Model"
14771:
14482:
14461:
14089:
13945:
13697:
12955:. In Su, Jian; Duh, Kevin; Carreras, Xavier (eds.).
12840:"Sequence to Sequence Learning with Neural Networks"
12838:
Sutskever, Ilya; Vinyals, Oriol; Le, Quoc V (2014).
11912:
based on requirements expressed in natural language.
10252:
Reformer (2020) reduces the computational load from
7017:
is the function implemented by the sublayer itself.
6833:
Block diagram for the full Transformer architecture.
4615:
respectively. Then we can represent the attention as
3687:
in the query sequence, it is multiplied by a matrix
3414:
The number of neurons in the middle layer is called
2669:{\displaystyle f:\mathbb {R} \to \mathbb {C} ^{d/2}}
2546:{\displaystyle \theta ={\frac {t}{r^{k}}},r=N^{2/d}}
1619:", and "prefixLM" (prefix language modeling) is not
1163:(LSTM). The architecture consists of two parts. The
14923:
13948:"Position Information in Transformers: An Overview"
13272:
12250:
7534:
7499:
6564:is applied to each row of the matrix individually.
4813:. The output dimension of an attention head is its
1764:representation of the token by an embedding matrix
1631:All transformers have the same primary components:
1377:
1369:(2024), are based on the Transformer architecture.
14569:Generating Long Sequences with Sparse Transformers
14023:
12837:
12741:
11755:
11710:
11658:
11519:
11465:
11419:
11106:
10805:
10688:
10643:
10591:
10394:
10361:
10318:
10280:
10241:
10221:
10161:
10125:
10089:
10062:
10026:
9990:
9884:
9864:
9837:
9729:
9569:
9529:
9279:
9231:
9165:
9145:
9022:
8999:
8979:
8934:
8742:
8722:
8702:
8572:
8552:
8391:
8327:
8304:
8223:
8145:
7757:
7737:
7478:
7336:output_distributions.append(decoder.unembed(z_d))
7105:
7009:
6957:
6825:Transformer decoder with norm-first and norm-last.
6817:Transformer encoder with norm-first and norm-last.
6776:
6749:
6556:
6534:
6463:
6439:
6148:
6128:
6081:
5883:
5772:
5752:
5729:
5709:
5691:, should not have access to the token at position
5683:
5655:
5595:
5563:
5489:
5459:
5432:
5412:
5344:
5324:
5160:
5125:
5038:
4985:
4929:
4835:
4805:
4775:
4745:
4710:
4607:
4580:
4553:
4526:
4506:
4486:
4466:
4439:
4419:
4389:
4369:
4329:
4309:
4289:
4249:
4229:
4209:
4182:
4151:
4120:
4093:
4062:
4042:
4022:
3989:
3949:
3893:
3850:
3807:
3764:
3706:
3679:
3641:
3614:
3584:
3557:
3530:
3461:
3399:
3379:
3249:
3222:
3125:
3098:
2942:
2911:
2820:
2779:
2668:
2614:
2588:
2568:
2545:
2478:
2305:
2285:
2198:
2144:
2098:
1990:
1952:
1858:
1796:
1776:
1727:
1671:
1584:
14755:Advances in Neural Information Processing Systems
14540:Advances in Neural Information Processing Systems
14284:
14141:Advances in Neural Information Processing Systems
13878:Advances in Neural Information Processing Systems
13704:Advances in Neural Information Processing Systems
12844:Advances in Neural Information Processing Systems
12562:
12560:
12205:
12096:Advances in Neural Information Processing Systems
7511:. Other activation functions were developed. The
16456:
13362:
12891:
12870:
12226:
11817:For image generation, notable architectures are
7531:. Other examples include ScaleNorm, or FixNorm.
6791:
1027:(LLM) on large (language) datasets, such as the
14134:
14111:Ke, Guolin; He, Di; Liu, Tie-Yan (2021-03-15),
14068:
14047:
13850:
13288:
12950:
12777:
12650:
12613:http://cogprints.org/1380/1/vdM_correlation.pdf
12175:"Better Language Models and Their Implications"
10188:
8392:{\displaystyle \theta ^{(1)},...,\theta ^{(n)}}
8224:{\displaystyle z_{m}:=x_{m}^{(1)}+ix_{m}^{(2)}}
3765:{\displaystyle q_{i}=x_{i,{\text{query}}}W^{Q}}
3472:
3175:models, the original transformer model used an
1401:Transformers typically are first pretrained by
13042:RWKV: Reinventing RNNs for the Transformer Era
12783:
12557:
12447:
12353:
12351:
9537:with Multi-Query Attention, there is just one
9037:
8588:ALiBi (Attention with Linear Biases) is not a
5126:{\displaystyle \left(W^{Q},W^{K},W^{V}\right)}
4723:The number of dimensions in a query vector is
3462:{\displaystyle d_{\text{ffn}}=4d_{\text{emb}}}
1252:, which are easy to parallelize, and achieved
918:List of datasets for machine-learning research
15546:
15088:
14786:
14353:
14007:"Convolutional Sequence to Sequence Learning"
13907:
13871:
13268:
13266:
12634:Hinton, Geoffrey E.; Plaut, David C. (1987).
12448:Feldman, J. A.; Ballard, D. H. (1982-07-01).
12120:(1 November 1997). "Long Short-Term Memory".
11973: – Family of machine learning approaches
10176:
8545:
8523:
8505:
8482:
8467:
8451:
8433:
8416:
8268:
8245:
7837:
7779:
7518:
6864:There are two common conventions in use: the
5413:{\displaystyle W_{i}^{Q},W_{i}^{K},W_{i}^{V}}
3649:. Similarly for the key and value sequences.
3150:
1212:, which replaced the previous model based on
951:
15560:
15041:
14947:
14807:
14378:: CS1 maint: multiple names: authors list (
13730:
12605:
12516:
12220:
12087:; Kaiser, Łukasz; Polosukhin, Illia (2017).
11366:
11352:
11253:
11239:
11101:
11060:
11053:
11004:
10997:
10989:
10980:
10939:
10932:
10883:
10876:
10868:
10837:
10825:
10774:
10761:
10741:
10711:
10576:
10557:
10545:
10526:
10511:
10492:
10480:
10461:
7515:used SwiGLU; both GPT-1 and BERT used GELU.
4163:which normalizes the weights. The fact that
3492:Scaled dot-product attention, block diagram.
2473:
2435:
14457:
14455:
13731:Phuong, Mary; Hutter, Marcus (2022-07-19),
12912:
12737:
12735:
12733:
12669:
12636:"Using Fast Weights to Deblur Old Memories"
12633:
12618:
12566:
12450:"Connectionist models and their properties"
12377:
12375:
12348:
12201:
12199:
10409:
9290:
7128:Encoder input t_e Decoder input t_d
5497:, but that is rarely the case in practice.
3187:Both the encoder and decoder layers have a
1635:Tokenizers, which convert text into tokens.
1311:, improving upon the line of research from
1239:Attention (machine learning) § History
1100:(RNNs). A well-cited early example was the
15553:
15539:
15095:
15081:
14748:
13263:
13066:"Was Linguistic A.I. Created by Accident?"
12517:Giles, C. Lee; Maxwell, Tom (1987-12-01).
10369:. For example, BigBird (2020) uses random
6845:for the full Transformer architecture, in
4903:
4876:
3622:, and each entry is a vector of dimension
1232:
958:
944:
15062:
15047:
14999:
14989:
14957:
14933:
14908:
14820:
14798:
14777:
14701:
14668:
14646:
14625:
14576:
14551:
14489:
14467:
14444:
14405:
14298:
14269:
14247:
14152:
14120:
14095:
14074:
14053:
14033:
13973:
13963:
13922:
13889:
13856:
13835:
13777:
13740:
13715:
13671:
13661:
13625:"Keras documentation: GPT2Backbone model"
13543:
13518:
13511:UL2: Unifying Language Learning Paradigms
13432:
13402:
13369:
13345:
13304:
13279:
13254:
13105:
13049:
12983:
12897:
12876:
12855:
12807:
12797:
12768:
12747:
12713:
12363:
12300:
12260:
12232:
12211:
11991: – Type of artificial neural network
11834:The transformer has had great success in
11804:, first turning the speech signal into a
10861:
10704:
5625:
3615:{\displaystyle \ell _{\text{seq, query}}}
3036:
2936:
2648:
2639:
2267:
2247:
2238:
1144:
1007:table. At each layer, each token is then
987:architecture developed by researchers at
14452:
13391:The Journal of Machine Learning Research
13358:
13356:
13243:
13241:
12730:
12372:
12196:
11520:{\displaystyle \varphi (k_{i})v_{i}^{T}}
9362:
6836:
6828:
6820:
6812:
6800:
6578:
6167:
6129:{\displaystyle PM_{\text{causal}}P^{-1}}
5061:
5053:
5049:
3495:
3487:
3198:
3162:
3154:
3147:, this is a mere notational difference.
2943:{\displaystyle \Delta t\in \mathbb {R} }
2159:
1405:on a large generic dataset, followed by
1383:training steps), before decaying again.
1065:. It has also led to the development of
969:
14897:
14895:
14858:
14856:
14110:
13829:
12106:
10249:is the number of tokens in a sequence.
9838:{\displaystyle x_{1},x_{2},...,x_{512}}
9766:
9327:, such that each block fits within the
3894:{\displaystyle V=X_{\text{value}}W^{V}}
3815:Similarly, we construct the key matrix
3808:{\displaystyle Q=X_{\text{query}}W^{Q}}
1760:. Equivalently stated, it multiplies a
1479:translation between natural languages (
14:
16457:
13874:"Root Mean Square Layer Normalization"
13533:
13531:
13529:
13095:
13093:
13063:
12997:
12995:
12695:
12693:
12691:
12029:(2014) further reduced its complexity.
7389:changing the location of normalization
7010:{\displaystyle \mathrm {Sublayer} (x)}
6542:with the implicit convention that the
4514:are defined as the matrices where the
3194:
2155:
1505:Large language model § Evaluation
1457:pretraining tasks. Some examples are:
1248:applied a self-attention mechanism to
15534:
15076:
14506:"Reformer: The Efficient Transformer"
14391:
14389:
13903:
13901:
13802:"Recent Advances in Google Translate"
13755:
13753:
13751:
13588:
13586:
13584:
13504:
13502:
13500:
13498:
13473:
13471:
13446:
13444:
13353:
13238:
13216:
13187:
13147:"Recent Advances in Google Translate"
12906:
12885:
12414:
12412:
12410:
12246:
12244:
10414:Random Feature Attention (2021) uses
7391:, etc. This is also usually used for
6180:individually. Schematically, we have:
5058:Multiheaded attention, block diagram.
3851:{\displaystyle K=X_{\text{key}}W^{K}}
3642:{\displaystyle d_{\text{emb, query}}}
1953:{\displaystyle \mathrm {Embed} (3)=M}
1804:, then the one-hot representation is
1784:. For example, if the input token is
1728:{\displaystyle n_{\text{vocabulary}}}
1616:
1604:are trained by autoregressive tasks.
1334:. In 2022, a chatbot based on GPT-3,
995:mechanism, proposed in a 2017 paper "
16391:Generative adversarial network (GAN)
15515:
14892:
14853:
13872:Zhang, Biao; Sennrich, Rico (2019).
13766:Journal of Machine Learning Research
13297:Rethinking Attention with Performers
12786:Frontiers in Artificial Intelligence
12322:IEEE Robotics and Automation Letters
12169:
12167:
12003: – Type of large language model
10696:. This choice of parameters satisfy
3680:{\displaystyle x_{i,{\text{query}}}}
2168:positional encoding with parameters
1650:
1396:
1349:, speech recognition, robotics, and
15331:Quantum Artificial Intelligence Lab
13592:
13526:
13329:
13090:
12992:
12688:
12496:. Cambridge, Mass: Bradford Books.
12480:
11466:{\displaystyle \sigma =d_{K}^{1/4}}
9007:represent full attention paid, and
6853:The final points of detail are the
5666:
2009:The un-embedding layer is a linear-
1453:report documents a large number of
1071:generative pre-trained transformers
913:Glossary of artificial intelligence
24:
15477:Generative pre-trained transformer
15020:
14926:Zero-Shot Text-to-Image Generation
14419:
14386:
13898:
13748:
13733:Formal Algorithms for Transformers
13581:
13495:
13468:
13441:
13378:
12913:Lewis-Kraus, Gideon (2016-12-14).
12407:
12241:
12001:Generative pre-trained transformer
11584:
11171:
9359:GPUs and new data types like FP8.
9104:
9017:
8658:
7494:
7444:
7087:
7084:
7081:
7078:
7075:
7072:
7069:
7066:
7063:
7055:
7052:
7049:
7046:
7043:
7040:
7037:
7034:
6994:
6991:
6988:
6985:
6982:
6979:
6976:
6973:
6939:
6936:
6933:
6930:
6927:
6924:
6921:
6918:
6904:
6901:
6898:
6895:
6892:
6889:
6886:
6883:
6880:
6599:information flow. This allows for
6014:
5984:
5971:
5951:
5938:
5930:
5848:
5747:
4676:
3279:
3276:
3273:
3060:
3047:
3044:
3041:
3038:
2992:
2926:
2885:
2872:
2869:
2866:
2863:
2850:
2426:
2074:
2071:
2068:
2065:
2062:
2059:
2056:
2039:
2036:
2033:
2030:
2027:
2024:
2021:
1886:
1883:
1880:
1877:
1874:
1738:Some commonly used tokenizers are
1621:"prefixLM" (prefix language model)
1357:. Image and video generators like
25:
16481:
15038:, Harvard NLP group, 3 April 2018
14978:Journal of Translational Medicine
12164:
11756:{\displaystyle N(0,\sigma ^{2}I)}
10689:{\displaystyle N(0,\sigma ^{2}I)}
9318:
9232:{\displaystyle B_{i,j}=B_{i',j'}}
8730:is a real number ("scalar"), and
3950:{\displaystyle W^{Q},W^{K},W^{V}}
3483:
1517:for the task is typically sum of
1210:Google Neural Machine Translation
16429:
16428:
16408:
15514:
15505:
15504:
14965:
14941:
14917:
14829:
13622:
13022:from the original on 20 Mar 2024
13001:
11952:search) transformer achieved an
11770:
10162:{\displaystyle {\tilde {x}}_{4}}
10126:{\displaystyle {\tilde {x}}_{3}}
10063:{\displaystyle {\tilde {x}}_{2}}
10027:{\displaystyle {\tilde {x}}_{1}}
9325:matrix multiplications in blocks
7535:Alternative positional encodings
7500:Alternative activation functions
7452:
7435:
7385:alternative activation functions
6233:combine them into a matrix
6093:considers all masks of the form
4937:but is otherwise unconstrained.
4806:{\displaystyle d_{\text{value}}}
4746:{\displaystyle d_{\text{query}}}
4370:{\displaystyle q_{j}\cdot k_{i}}
4290:{\displaystyle q_{i}\cdot k_{j}}
3904:It is usually the case that all
1742:, WordPiece, and SentencePiece.
1576: conditional on its context
1378:Methods for stabilizing training
14742:
14718:
14677:
14655:
14634:
14613:
14602:from the original on 2021-09-18
14584:
14560:
14527:
14516:from the original on 2020-10-22
14498:
14476:
14428:
14413:
14347:
14325:
14278:
14256:
14234:
14210:
14185:
14161:
14128:
14104:
14083:
14062:
14041:
14017:
13998:
13939:
13865:
13844:
13823:
13812:from the original on 4 Jul 2024
13794:
13724:
13691:
13680:from the original on 2020-10-21
13641:
13616:
13605:from the original on 2020-10-18
13570:from the original on 2020-10-21
13552:
13419:
13312:
13205:from the original on 2023-03-18
13163:
13139:
13114:
13057:
13033:
12971:
12944:
12864:
12831:
12684:. Springer. pp. 9355–9366.
12627:
12510:
12441:
12430:from the original on 2021-01-13
12185:from the original on 2020-12-19
12032:
12020:
11829:
11711:{\displaystyle w_{1},...,w_{D}}
10644:{\displaystyle w_{1},...,w_{D}}
5596:{\displaystyle 12\times 64=768}
5490:{\displaystyle d_{\text{head}}}
4836:{\displaystyle d_{\text{head}}}
4152:{\displaystyle {\sqrt {d_{k}}}}
2425:
2000:
1682:
1626:
1612:are trained by prefixLM tasks.
1296:that contribute to the ongoing
1214:statistical machine translation
1091:
16341:Recurrent neural network (RNN)
16331:Differentiable neural computer
13064:Marche, Stephen (2024-08-23).
12676:Schlag, Imanol; Irie, Kazuki;
12393:10.18653/v1/2020.emnlp-demos.6
12309:
12268:
11750:
11728:
11653:
11618:
11557:
11539:
11499:
11486:
11411:
11398:
11328:
11321:
11298:
11285:
11215:
11208:
11144:
11126:
11098:
11092:
11042:
11036:
10983:
10977:
10971:
10921:
10915:
10865:
10744:
10738:
10732:
10723:
10717:
10708:
10683:
10661:
10580:
10452:
10434:
10428:
10389:
10383:
10356:
10343:
10313:
10298:
10275:
10262:
10216:
10203:
10147:
10111:
10048:
10012:
9976:
9954:
9932:
9910:
9709:
9656:
9641:
9628:
9609:
9591:
9509:
9446:
9431:
9418:
9399:
9381:
9076:
9058:
8630:
8612:
8384:
8378:
8353:
8347:
8216:
8210:
8189:
8183:
8118:
8112:
8082:
8076:
8045:
8039:
8009:
8003:
7969:
7963:
7944:
7938:
7824:
7818:
7800:
7794:
7732:
7717:
7712:
7706:
7688:
7682:
7669:
7663:
7658:
7652:
7634:
7628:
7615:
7609:
7604:
7598:
7580:
7574:
7561:
7558:
7504:The original transformer uses
7343:
7100:
7097:
7091:
7059:
7004:
6998:
6952:
6949:
6943:
6908:
6740:
6737:
6700:
6692:
6677:
6671:
6659:
6641:
6529:
6526:
6508:
6500:
6489:
6483:
6415:
6406:
6387:
6379:
6367:
6358:
6339:
6331:
6308:
6302:
5814:
5796:
5642:
5630:
5309:
5306:
5243:
5235:
5230:
5217:
5198:
5180:
4776:{\displaystyle d_{\text{key}}}
4649:
4631:
3372:
3366:
3353:
3347:
3339:
3334:
3328:
3315:
3309:
3298:
3289:
3283:
3250:{\displaystyle d_{\text{emb}}}
3223:{\displaystyle d_{\text{emb}}}
3093:
3087:
3076:
3073:
3057:
3051:
3005:
2983:
2906:
2900:
2894:
2891:
2882:
2876:
2856:
2841:
2691:
2685:
2643:
2422:
2419:
2413:
2401:
2395:
2386:
2380:
2362:
2355:
2337:
2330:
2324:
2242:
2139:
2113:
2093:
2078:
2049:
2043:
1991:{\displaystyle d_{\text{emb}}}
1944:
1902:
1896:
1890:
1853:
1811:
1579:
1563:
1386:A 2020 paper found that using
1286:
1017:recurrent neural architectures
333:Relevance vector machine (RVM)
18:Transformer (machine learning)
13:
1:
16386:Variational autoencoder (VAE)
16346:Long short-term memory (LSTM)
15613:Computational learning theory
15102:
13595:"The Illustrated Transformer"
12466:10.1016/S0364-0213(82)80001-3
12045:
7355:for downstream applications.
7117:
6792:Full transformer architecture
3957:are square matrices, meaning
2199:{\displaystyle N=10000,d=100}
1866:, and its embedding vector is
1620:
1355:convolutional neural networks
1326:Starting in 2018, the OpenAI
822:Computational learning theory
386:Expectation–maximization (EM)
16470:Neural network architectures
16366:Convolutional neural network
11926:biological sequence analysis
11800:follow the same pattern for
10189:Alternative attention graphs
7543:positional encoding module.
6796:
6557:{\displaystyle {\text{FFN}}}
6464:{\displaystyle {\text{FFN}}}
3473:Scaled dot-product attention
3135:convolutional neural network
1745:
1087:Timeline of machine learning
991:and based on the multi-head
779:Coefficient of determination
626:Convolutional neural network
338:Support vector machine (SVM)
32:Transformer (disambiguation)
7:
16361:Multilayer perceptron (MLP)
14751:"Visual Instruction Tuning"
12665:. PMLR. pp. 5156–5165.
12089:"Attention is All you Need"
11964:
11836:natural language processing
9570:{\displaystyle W^{K},W^{V}}
9038:Relative Position Encodings
8980:{\displaystyle B_{i,j}=j-i}
7388:
6847:object-oriented programming
5143:feed-forward neural network
4427:, the attention from token
3189:feed-forward neural network
1521:for the masked-out tokens:
1489:The course is jumping well.
1465:might generate the output,
1372:
1332:natural language generation
1047:natural language processing
930:Outline of machine learning
827:Empirical risk minimization
10:
16486:
16437:Artificial neural networks
16351:Gated recurrent unit (GRU)
15577:Differentiable programming
14991:10.1186/s12967-023-04011-y
14546:. Curran Associates, Inc.
13884:. Curran Associates, Inc.
13710:. Curran Associates, Inc.
13479:"Causal language modeling"
13452:"Masked language modeling"
12915:"The Great A.I. Awakening"
12850:. Curran Associates, Inc.
12134:10.1162/neco.1997.9.8.1735
10328:locality-sensitive hashing
10177:Sub-quadratic transformers
9742:
7519:Alternative normalizations
6638:MaskedMultiheadedAttention
6574:
6163:
3714:to produce a query vector
3476:
3230:-dimensional vectors into
3159:One encoder-decoder block.
3151:Encoder-decoder (overview)
1749:
1686:
1502:
1236:
1229:, was proposed for LSTMs.
1148:
1106:vanishing-gradient problem
1084:
1080:
567:Feedforward neural network
318:Artificial neural networks
29:
16404:
16318:
16262:
16191:
16124:
15996:
15896:
15889:
15843:
15807:
15770:Artificial neural network
15750:
15626:
15593:Automatic differentiation
15566:
15500:
15466:Attention Is All You Need
15457:
15436:
15389:
15360:
15353:
15323:
15294:
15287:
15248:
15217:
15188:
15147:
15140:
15133:
15110:
15029:The Annotated transformer
13952:Computational Linguistics
12591:10.1162/neco.1992.4.1.131
12102:. Curran Associates, Inc.
10319:{\displaystyle O(N\ln N)}
9311:is a library produced by
9280:{\displaystyle i-j=i'-j'}
7487:benchmarked comparisons.
7248:1:length(decoder.layers)
7158:1:length(encoder.layers)
6593:encoder-decoder attention
6193:given input vectors
2821:{\displaystyle r=N^{2/d}}
1273:Attention is all you need
1098:recurrent neural networks
997:Attention Is All You Need
550:Artificial neural network
15598:Neuromorphic engineering
15561:Differentiable computing
14703:10.1609/aaai.v36i7.20729
13397:(1): 140:5485–140:5551.
12334:10.1109/LRA.2022.3229266
12013:
11904:named entity recognition
10410:Random Feature Attention
10362:{\displaystyle O(N^{2})}
10330:and reversible layers.
10281:{\displaystyle O(N^{2})}
10222:{\displaystyle O(N^{2})}
9291:Efficient implementation
9023:{\displaystyle -\infty }
8583:
5753:{\displaystyle -\infty }
5603:, its projection matrix
3565:, and the value weights
2217:, as for example, both "
1498:
1425:next-sentence prediction
1403:self-supervised learning
1227:intra-sentence attention
859:Journals and conferences
806:Mathematical foundations
716:Temporal difference (TD)
572:Recurrent neural network
492:Conditional random field
415:Dimensionality reduction
163:Dimensionality reduction
125:Quantum machine learning
120:Neuromorphic engineering
80:Self-supervised learning
75:Semi-supervised learning
16371:Residual neural network
15787:Artificial Intelligence
14309:10.1145/3600006.3613165
13322:A ConvNet for the 2020s
13228:, OpenAI, June 11, 2018
13225:finetune-transformer-lm
12799:10.3389/frai.2020.00040
11956:of 2895, putting it at
11823:variational autoencoder
10416:Fourier random features
7765:. Then RoPE encoding is
7758:{\displaystyle \theta }
7546:
7353:representation learning
6610:Schematically, we have:
2615:{\displaystyle N=10000}
1303:In language modelling,
1233:Parallelizing attention
1111:A key breakthrough was
268:Apprenticeship learning
15341:Tensor Processing Unit
14880:Cite journal requires
14420:Fu, Yao (2023-12-13).
13933:10.5281/zenodo.3525484
11894:document summarization
11765:Gram-Schmidt processed
11757:
11712:
11660:
11521:
11467:
11421:
11108:
10807:
10690:
10645:
10593:
10396:
10363:
10320:
10282:
10243:
10223:
10163:
10127:
10091:
10064:
10028:
9992:
9886:
9866:
9839:
9731:
9571:
9531:
9281:
9233:
9167:
9147:
9024:
9001:
8981:
8936:
8744:
8724:
8704:
8574:
8554:
8393:
8329:
8306:
8225:
8147:
7759:
7745:. Now pick some angle
7739:
7480:
7107:
7011:
6959:
6850:
6834:
6826:
6818:
6806:
6778:
6751:
6584:
6558:
6536:
6465:
6441:
6173:
6150:
6130:
6083:
5885:
5774:
5754:
5731:
5711:
5685:
5657:
5597:
5565:
5491:
5461:
5434:
5414:
5346:
5326:
5162:
5133:matrices is called an
5127:
5067:
5059:
5040:
4987:
4931:
4837:
4807:
4777:
4753:and similarly for the
4747:
4712:
4609:
4582:
4555:
4528:
4508:
4488:
4468:
4441:
4421:
4420:{\displaystyle a_{ij}}
4391:
4371:
4331:
4311:
4291:
4251:
4231:
4211:
4184:
4153:
4122:
4095:
4064:
4044:
4024:
4023:{\displaystyle a_{ij}}
3991:
3951:
3895:
3852:
3809:
3766:
3708:
3681:
3643:
3616:
3586:
3559:
3532:
3501:
3493:
3463:
3401:
3381:
3263:multilayer perceptrons
3258:
3251:
3224:
3168:
3160:
3127:
3100:
2944:
2913:
2822:
2781:
2670:
2616:
2590:
2570:
2547:
2480:
2307:
2287:
2206:
2200:
2146:
2100:
1992:
1954:
1860:
1798:
1778:
1729:
1673:
1586:
1246:decomposable attention
1161:long short-term memory
1151:Seq2seq § History
1145:Attention with seq2seq
1059:reinforcement learning
1021:long short-term memory
976:
817:Bias–variance tradeoff
699:Reinforcement learning
675:Spiking neural network
85:Reinforcement learning
16326:Neural Turing machine
15914:Human image synthesis
14841:sites.research.google
13175:MIT Technology Review
12027:Gated recurrent units
11995:BERT (language model)
11910:writing computer code
11840:large language models
11758:
11713:
11661:
11522:
11468:
11422:
11109:
10808:
10691:
10646:
10594:
10397:
10364:
10321:
10283:
10244:
10224:
10164:
10128:
10092:
10090:{\displaystyle x_{3}}
10065:
10029:
9993:
9887:
9867:
9865:{\displaystyle x_{t}}
9840:
9774:speculative execution
9732:
9572:
9532:
9363:Multi-Query Attention
9282:
9234:
9168:
9148:
9025:
9002:
8982:
8937:
8745:
8725:
8705:
8575:
8555:
8394:
8330:
8307:
8226:
8148:
7760:
7740:
7527:which is used in the
7481:
7403:are encoder-decoder.
7397:instruction following
7369:instruction following
7340:output_distributions
7210:z_e ← z_e + z_e_copy
7108:
7012:
6960:
6840:
6832:
6824:
6816:
6804:
6779:
6777:{\displaystyle H^{E}}
6752:
6582:
6559:
6537:
6466:
6442:
6171:
6151:
6131:
6084:
5886:
5775:
5755:
5732:
5712:
5686:
5658:
5598:
5566:
5492:
5462:
5460:{\displaystyle W^{O}}
5435:
5415:
5347:
5327:
5163:
5128:
5065:
5057:
5050:Multiheaded attention
5041:
4988:
4932:
4838:
4808:
4778:
4748:
4713:
4610:
4608:{\displaystyle v_{i}}
4583:
4581:{\displaystyle k_{i}}
4556:
4554:{\displaystyle q_{i}}
4529:
4509:
4489:
4469:
4442:
4422:
4392:
4372:
4332:
4317:will attend to token
4312:
4292:
4252:
4232:
4212:
4210:{\displaystyle W^{K}}
4185:
4183:{\displaystyle W^{Q}}
4154:
4123:
4121:{\displaystyle k_{j}}
4096:
4094:{\displaystyle q_{i}}
4065:
4045:
4025:
3992:
3952:
3896:
3858:and the value matrix
3853:
3810:
3767:
3709:
3707:{\displaystyle W^{Q}}
3682:
3644:
3617:
3587:
3585:{\displaystyle W^{V}}
3560:
3558:{\displaystyle W^{K}}
3533:
3531:{\displaystyle W^{Q}}
3499:
3491:
3479:Dot-product attention
3464:
3402:
3400:{\displaystyle \phi }
3382:
3257:-dimensional vectors.
3252:
3225:
3202:
3166:
3158:
3128:
3126:{\displaystyle c_{j}}
3101:
2945:
2914:
2823:
2782:
2671:
2617:
2591:
2571:
2548:
2481:
2308:
2288:
2201:
2163:
2147:
2106:The matrix has shape
2101:
1993:
1955:
1861:
1799:
1779:
1750:Further information:
1730:
1674:
1594:BERT series of models
1587:
1434:reading comprehension
1340:large language models
1319:. It was followed by
1173:gated recurrent units
1130:higher-order networks
1025:large language models
973:
653:Neural radiance field
475:Structured prediction
198:Structured prediction
70:Unsupervised learning
16417:Computer programming
16396:Graph neural network
15971:Text-to-video models
15949:Text-to-image models
15797:Large language model
15782:Scientific computing
15588:Statistical manifold
15583:Information geometry
13975:10.1162/coli_a_00445
13673:10.18653/v1/W19-4828
12965:10.18653/v1/D16-1053
12535:10.1364/AO.26.004972
11989:Large language model
11796:Conformer and later
11722:
11670:
11531:
11480:
11431:
11118:
10817:
10700:
10655:
10603:
10422:
10395:{\displaystyle O(N)}
10377:
10371:small-world networks
10337:
10292:
10256:
10233:
10197:
10137:
10101:
10074:
10038:
10002:
9900:
9876:
9849:
9784:
9767:Speculative decoding
9583:
9541:
9378:MultiheadedAttention
9373:
9243:
9181:
9157:
9046:
9011:
8991:
8946:
8758:
8734:
8714:
8600:
8564:
8406:
8339:
8316:
8235:
8157:
7769:
7749:
7555:
7410:
7399:. The models in the
7371:. The models in the
7363:is usually used for
7024:
6969:
6876:
6855:residual connections
6761:
6697:MultiheadedAttention
6614:
6546:
6505:MultiheadedAttention
6475:
6453:
6384:MultiheadedAttention
6336:MultiheadedAttention
6184:
6140:
6097:
5898:
5784:
5764:
5741:
5721:
5695:
5675:
5663:is a square matrix.
5607:
5575:
5504:
5474:
5444:
5424:
5356:
5336:
5177:MultiheadedAttention
5172:
5152:
5074:
4997:
4944:
4847:
4820:
4790:
4760:
4730:
4619:
4592:
4565:
4538:
4534:th rows are vectors
4518:
4498:
4478:
4458:
4431:
4401:
4381:
4341:
4321:
4301:
4261:
4241:
4221:
4194:
4167:
4132:
4105:
4078:
4054:
4034:
4004:
3961:
3908:
3862:
3819:
3776:
3718:
3691:
3656:
3626:
3599:
3569:
3542:
3515:
3430:
3391:
3269:
3234:
3207:
3110:
2957:
2923:
2835:
2791:
2679:
2629:
2600:
2580:
2560:
2490:
2321:
2297:
2228:
2172:
2110:
2017:
1975:
1870:
1808:
1788:
1768:
1712:
1660:
1602:GPT series of models
1568:probability of
1525:
1250:feedforward networks
1222:, originally called
1121:multiplicative units
842:Statistical learning
740:Learning with humans
532:Local outlier factor
30:For other uses, see
15763:In-context learning
15603:Pattern recognition
15472:Future of Go Summit
14512:. 16 January 2020.
12724:10.3115/v1/D14-1179
12678:Schmidhuber, Jürgen
12568:Schmidhuber, Jürgen
12426:. 2 November 2018.
12118:Schmidhuber, Jürgen
12007:T5 (language model)
11931:video understanding
11899:document generation
11883:machine translation
11791:Vision transformers
11516:
11462:
11315:
9676:
9588:MultiQueryAttention
9508:
9487:
9466:
8220:
8193:
8122:
8086:
8049:
8013:
7973:
7948:
7828:
7804:
7716:
7692:
7662:
7638:
7608:
7584:
7509:activation function
6859:layer normalization
5710:{\displaystyle t+1}
5409:
5391:
5373:
5305:
5284:
5263:
3195:Feedforward network
2313:is a positive even
2156:Positional encoding
1610:T5 series of models
1481:machine translation
1388:layer normalization
1281:machine translation
1117:attention mechanism
1067:pre-trained systems
1055:vision transformers
1043:machine translation
685:Electrochemical RAM
592:reservoir computing
323:Logistic regression
242:Supervised learning
228:Multimodal learning
203:Feature engineering
148:Generative modeling
110:Rule-based learning
105:Curriculum learning
65:Supervised learning
40:Part of a series on
16356:Echo state network
16244:Jürgen Schmidhuber
15939:Facial recognition
15934:Speech recognition
15844:Software libraries
15218:In popular culture
15034:2021-09-22 at the
14337:, vLLM, 2024-06-20
14013:. PMLR: 1243–1252.
13599:jalammar.github.io
13126:Search Engine Land
12919:The New York Times
12579:Neural Computation
12387:. pp. 38–45.
12284:. PMLR: 7487–7498.
12122:Neural Computation
11983:Vision transformer
11802:speech recognition
11753:
11708:
11656:
11517:
11502:
11463:
11440:
11417:
11346:
11301:
11233:
11104:
10803:
10686:
10641:
10589:
10392:
10359:
10316:
10278:
10239:
10219:
10159:
10123:
10087:
10060:
10024:
9988:
9882:
9862:
9835:
9727:
9662:
9567:
9527:
9494:
9473:
9452:
9277:
9229:
9163:
9143:
9141:
9020:
8997:
8977:
8932:
8926:
8740:
8720:
8700:
8698:
8570:
8550:
8389:
8328:{\displaystyle 2n}
8325:
8302:
8221:
8200:
8173:
8143:
8137:
8102:
8066:
8029:
7993:
7976:
7953:
7928:
7914:
7808:
7784:
7755:
7735:
7696:
7672:
7642:
7618:
7588:
7564:
7476:
7470:
7379:are decoder-only.
7103:
7007:
6955:
6851:
6835:
6827:
6819:
6807:
6774:
6747:
6745:
6585:
6583:One decoder layer.
6554:
6532:
6461:
6437:
6435:
6427:
6287:
6174:
6172:One encoder layer.
6158:permutation matrix
6146:
6126:
6079:
6073:
5881:
5879:
5770:
5750:
5727:
5707:
5681:
5653:
5593:
5561:
5487:
5457:
5430:
5410:
5395:
5377:
5359:
5342:
5322:
5291:
5270:
5249:
5158:
5123:
5068:
5060:
5036:
4983:
4927:
4833:
4803:
4773:
4743:
4708:
4706:
4605:
4578:
4551:
4524:
4504:
4484:
4464:
4437:
4417:
4387:
4367:
4327:
4307:
4287:
4247:
4227:
4207:
4180:
4149:
4118:
4091:
4060:
4040:
4020:
3987:
3947:
3891:
3848:
3805:
3762:
3704:
3677:
3639:
3612:
3582:
3555:
3538:, the key weights
3528:
3502:
3494:
3459:
3397:
3377:
3259:
3247:
3220:
3169:
3161:
3123:
3106:for any constants
3096:
3025:
2969:
2940:
2909:
2818:
2777:
2666:
2612:
2586:
2566:
2543:
2476:
2303:
2283:
2207:
2196:
2142:
2096:
1988:
1950:
1856:
1794:
1774:
1740:byte pair encoding
1725:
1672:{\displaystyle xW}
1669:
1582:
1556:
1439:sentiment analysis
1429:question answering
1363:Stable Diffusion 3
1347:vision transformer
1258:textual entailment
1224:intra-attention or
977:
253: •
168:Density estimation
16452:
16451:
16214:Stephen Grossberg
16187:
16186:
15528:
15527:
15453:
15452:
15349:
15348:
15283:
15282:
15244:
15243:
15134:Computer programs
14598:. 25 March 2021.
14334:vllm-project/vllm
14318:979-8-4007-0229-7
14173:crfm.stanford.edu
13201:. June 11, 2018.
12529:(23): 4972–4978.
12503:978-0-262-68053-0
12454:Cognitive Science
11781:transfer learning
11651:
11603:
11602:
11566:
11537:
11415:
11337:
11224:
11190:
11189:
11153:
11124:
10799:
10450:
10449:
10242:{\displaystyle N}
10150:
10114:
10051:
10015:
9979:
9957:
9935:
9913:
9885:{\displaystyle t}
9654:
9638:
9619:
9589:
9444:
9428:
9409:
9379:
9166:{\displaystyle B}
9123:
9122:
9085:
9056:
9000:{\displaystyle 0}
8754:matrix defined by
8743:{\displaystyle B}
8723:{\displaystyle s}
8677:
8676:
8639:
8610:
8573:{\displaystyle k}
8519:
8478:
8447:
8412:
8241:
7775:
7465:
7420:
7377:Chinchilla series
6698:
6690:
6669:
6639:
6552:
6506:
6498:
6481:
6459:
6385:
6377:
6337:
6329:
6300:
6234:
6194:
6149:{\displaystyle P}
6110:
5908:
5867:
5866:
5823:
5794:
5773:{\displaystyle 0}
5730:{\displaystyle M}
5684:{\displaystyle t}
5552:
5533:
5514:
5484:
5433:{\displaystyle i}
5345:{\displaystyle X}
5332:where the matrix
5241:
5227:
5208:
5178:
5161:{\displaystyle i}
5033:
5020:
5007:
4980:
4967:
4954:
4924:
4911:
4897:
4884:
4870:
4857:
4830:
4800:
4770:
4740:
4695:
4694:
4658:
4629:
4527:{\displaystyle i}
4507:{\displaystyle V}
4487:{\displaystyle K}
4467:{\displaystyle Q}
4440:{\displaystyle i}
4390:{\displaystyle i}
4330:{\displaystyle i}
4310:{\displaystyle j}
4250:{\displaystyle j}
4237:attends to token
4230:{\displaystyle i}
4147:
4063:{\displaystyle j}
4043:{\displaystyle i}
3984:
3971:
3878:
3835:
3792:
3748:
3673:
3636:
3609:
3456:
3440:
3416:intermediate size
3244:
3217:
3016:
2960:
2767:
2589:{\displaystyle k}
2569:{\displaystyle N}
2514:
2306:{\displaystyle d}
2136:
2123:
1985:
1797:{\displaystyle 3}
1777:{\displaystyle M}
1722:
1651:following section
1577:
1569:
1553:
1539:
1531:
1471:me to your party
1421:language modeling
1397:Pretrain-finetune
1294:generative models
1125:sigma-pi networks
968:
967:
773:Model diagnostics
756:Human-in-the-loop
599:Boltzmann machine
512:Anomaly detection
308:Linear regression
223:Ontology learning
218:Grammar induction
193:Semantic analysis
188:Association rules
173:Anomaly detection
115:Neuro-symbolic AI
16:(Redirected from
16477:
16442:Machine learning
16432:
16431:
16412:
16167:Action selection
16157:Self-driving car
15964:Stable Diffusion
15929:Speech synthesis
15894:
15893:
15758:Machine learning
15634:Gradient descent
15555:
15548:
15541:
15532:
15531:
15518:
15517:
15508:
15507:
15492:Google Workspace
15358:
15357:
15292:
15291:
15288:Machine learning
15145:
15144:
15138:
15137:
15097:
15090:
15083:
15074:
15073:
15068:
15066:
15053:
15051:
15027:Alexander Rush,
15014:
15013:
15003:
14993:
14969:
14963:
14962:
14961:
14945:
14939:
14938:
14937:
14921:
14915:
14914:
14912:
14899:
14890:
14889:
14883:
14878:
14876:
14868:
14860:
14851:
14850:
14848:
14847:
14833:
14827:
14826:
14824:
14811:
14805:
14804:
14802:
14790:
14784:
14783:
14781:
14769:
14763:
14762:
14746:
14740:
14739:
14737:
14736:
14722:
14716:
14715:
14705:
14696:(7): 7628–7636.
14681:
14675:
14674:
14672:
14659:
14653:
14652:
14650:
14638:
14632:
14631:
14629:
14617:
14611:
14610:
14608:
14607:
14588:
14582:
14581:
14580:
14564:
14558:
14557:
14555:
14531:
14525:
14524:
14522:
14521:
14502:
14496:
14495:
14493:
14480:
14474:
14473:
14471:
14459:
14450:
14449:
14448:
14432:
14426:
14425:
14417:
14411:
14410:
14409:
14393:
14384:
14383:
14377:
14369:
14367:
14366:
14351:
14345:
14344:
14343:
14342:
14329:
14323:
14322:
14302:
14282:
14276:
14275:
14273:
14260:
14254:
14253:
14251:
14238:
14232:
14231:
14229:
14228:
14214:
14208:
14207:
14205:
14204:
14189:
14183:
14182:
14180:
14179:
14165:
14159:
14158:
14156:
14132:
14126:
14125:
14124:
14108:
14102:
14101:
14099:
14087:
14081:
14080:
14078:
14066:
14060:
14059:
14057:
14045:
14039:
14038:
14037:
14021:
14015:
14014:
14002:
13996:
13995:
13977:
13967:
13943:
13937:
13936:
13926:
13905:
13896:
13895:
13893:
13869:
13863:
13862:
13860:
13848:
13842:
13841:
13839:
13827:
13821:
13820:
13818:
13817:
13808:. June 8, 2020.
13798:
13792:
13791:
13781:
13757:
13746:
13745:
13744:
13728:
13722:
13721:
13719:
13695:
13689:
13688:
13686:
13685:
13675:
13665:
13645:
13639:
13638:
13636:
13635:
13620:
13614:
13613:
13611:
13610:
13590:
13579:
13578:
13576:
13575:
13556:
13550:
13549:
13547:
13535:
13524:
13523:
13522:
13506:
13493:
13492:
13490:
13489:
13475:
13466:
13465:
13463:
13462:
13448:
13439:
13438:
13436:
13423:
13417:
13416:
13406:
13382:
13376:
13375:
13373:
13360:
13351:
13350:
13349:
13333:
13327:
13326:
13316:
13310:
13309:
13308:
13292:
13286:
13285:
13283:
13270:
13261:
13260:
13258:
13245:
13236:
13235:
13234:
13233:
13220:
13214:
13213:
13211:
13210:
13191:
13185:
13184:
13182:
13181:
13167:
13161:
13160:
13158:
13157:
13143:
13137:
13136:
13134:
13133:
13118:
13112:
13111:
13109:
13097:
13088:
13087:
13085:
13084:
13061:
13055:
13054:
13053:
13037:
13031:
13030:
13028:
13027:
12999:
12990:
12989:
12987:
12975:
12969:
12968:
12948:
12942:
12941:
12939:
12938:
12929:. Archived from
12910:
12904:
12903:
12901:
12889:
12883:
12882:
12880:
12868:
12862:
12861:
12859:
12835:
12829:
12828:
12811:
12801:
12781:
12775:
12774:
12772:
12760:
12754:
12753:
12751:
12739:
12728:
12727:
12717:
12697:
12686:
12685:
12673:
12667:
12666:
12654:
12648:
12647:
12631:
12625:
12622:
12616:
12609:
12603:
12602:
12576:
12564:
12555:
12554:
12514:
12508:
12507:
12495:
12484:
12478:
12477:
12445:
12439:
12438:
12436:
12435:
12416:
12405:
12404:
12379:
12370:
12369:
12367:
12355:
12346:
12345:
12313:
12307:
12306:
12304:
12292:
12286:
12285:
12272:
12266:
12265:
12264:
12248:
12239:
12238:
12236:
12224:
12218:
12217:
12215:
12203:
12194:
12193:
12191:
12190:
12171:
12162:
12161:
12128:(8): 1735–1780.
12114:Hochreiter, Sepp
12110:
12104:
12103:
12093:
12077:
12039:
12036:
12030:
12024:
11763:, then they are
11762:
11760:
11759:
11754:
11746:
11745:
11717:
11715:
11714:
11709:
11707:
11706:
11682:
11681:
11665:
11663:
11662:
11657:
11652:
11650:
11649:
11640:
11638:
11630:
11629:
11608:
11604:
11601:
11600:
11591:
11590:
11589:
11588:
11587:
11573:
11567:
11564:
11538:
11535:
11526:
11524:
11523:
11518:
11515:
11510:
11498:
11497:
11472:
11470:
11469:
11464:
11461:
11457:
11448:
11426:
11424:
11423:
11418:
11416:
11414:
11410:
11409:
11394:
11393:
11392:
11391:
11379:
11374:
11373:
11364:
11363:
11345:
11336:
11335:
11316:
11314:
11309:
11297:
11296:
11281:
11280:
11279:
11278:
11266:
11261:
11260:
11251:
11250:
11232:
11223:
11222:
11203:
11195:
11191:
11188:
11187:
11178:
11177:
11176:
11175:
11174:
11160:
11154:
11151:
11125:
11122:
11113:
11111:
11110:
11105:
11088:
11087:
11086:
11085:
11073:
11068:
11067:
11032:
11031:
11030:
11029:
11017:
11012:
11011:
10967:
10966:
10965:
10964:
10952:
10947:
10946:
10911:
10910:
10909:
10908:
10896:
10891:
10890:
10864:
10856:
10855:
10854:
10853:
10844:
10812:
10810:
10809:
10804:
10802:
10801:
10800:
10798:
10797:
10796:
10783:
10782:
10781:
10759:
10707:
10695:
10693:
10692:
10687:
10679:
10678:
10650:
10648:
10647:
10642:
10640:
10639:
10615:
10614:
10598:
10596:
10595:
10590:
10588:
10587:
10569:
10568:
10538:
10537:
10504:
10503:
10473:
10472:
10451:
10445:
10441:
10401:
10399:
10398:
10393:
10368:
10366:
10365:
10360:
10355:
10354:
10325:
10323:
10322:
10317:
10287:
10285:
10284:
10279:
10274:
10273:
10248:
10246:
10245:
10240:
10228:
10226:
10225:
10220:
10215:
10214:
10183:Long Range Arena
10168:
10166:
10165:
10160:
10158:
10157:
10152:
10151:
10143:
10132:
10130:
10129:
10124:
10122:
10121:
10116:
10115:
10107:
10096:
10094:
10093:
10088:
10086:
10085:
10069:
10067:
10066:
10061:
10059:
10058:
10053:
10052:
10044:
10033:
10031:
10030:
10025:
10023:
10022:
10017:
10016:
10008:
9997:
9995:
9994:
9989:
9987:
9986:
9981:
9980:
9972:
9965:
9964:
9959:
9958:
9950:
9943:
9942:
9937:
9936:
9928:
9921:
9920:
9915:
9914:
9906:
9891:
9889:
9888:
9883:
9871:
9869:
9868:
9863:
9861:
9860:
9844:
9842:
9841:
9836:
9834:
9833:
9809:
9808:
9796:
9795:
9736:
9734:
9733:
9728:
9726:
9725:
9716:
9712:
9708:
9707:
9692:
9691:
9675:
9670:
9655:
9652:
9645:
9644:
9640:
9639:
9636:
9620:
9617:
9590:
9587:
9576:
9574:
9573:
9568:
9566:
9565:
9553:
9552:
9536:
9534:
9533:
9528:
9526:
9525:
9516:
9512:
9507:
9502:
9486:
9481:
9465:
9460:
9445:
9442:
9435:
9434:
9430:
9429:
9426:
9410:
9407:
9380:
9377:
9286:
9284:
9283:
9278:
9276:
9265:
9238:
9236:
9235:
9230:
9228:
9227:
9226:
9215:
9199:
9198:
9172:
9170:
9169:
9164:
9152:
9150:
9149:
9144:
9142:
9135:
9131:
9124:
9121:
9120:
9111:
9110:
9109:
9108:
9107:
9093:
9086:
9083:
9057:
9054:
9029:
9027:
9026:
9021:
9006:
9004:
9003:
8998:
8986:
8984:
8983:
8978:
8964:
8963:
8942:in other words,
8941:
8939:
8938:
8933:
8931:
8930:
8749:
8747:
8746:
8741:
8729:
8727:
8726:
8721:
8709:
8707:
8706:
8701:
8699:
8692:
8688:
8678:
8675:
8674:
8665:
8664:
8663:
8662:
8661:
8647:
8640:
8637:
8611:
8608:
8579:
8577:
8576:
8571:
8560:for any integer
8559:
8557:
8556:
8551:
8549:
8548:
8527:
8526:
8520:
8517:
8515:
8514:
8509:
8508:
8486:
8485:
8479:
8476:
8471:
8470:
8455:
8454:
8448:
8445:
8443:
8442:
8437:
8436:
8420:
8419:
8413:
8410:
8398:
8396:
8395:
8390:
8388:
8387:
8357:
8356:
8334:
8332:
8331:
8326:
8311:
8309:
8308:
8303:
8301:
8300:
8291:
8290:
8272:
8271:
8259:
8258:
8249:
8248:
8242:
8239:
8230:
8228:
8227:
8222:
8219:
8208:
8192:
8181:
8169:
8168:
8152:
8150:
8149:
8144:
8142:
8141:
8121:
8110:
8085:
8074:
8048:
8037:
8012:
8001:
7981:
7980:
7972:
7961:
7947:
7936:
7919:
7918:
7841:
7840:
7827:
7816:
7803:
7792:
7783:
7782:
7776:
7773:
7764:
7762:
7761:
7756:
7744:
7742:
7741:
7738:{\displaystyle }
7736:
7715:
7704:
7691:
7680:
7661:
7650:
7637:
7626:
7607:
7596:
7583:
7572:
7485:
7483:
7482:
7477:
7475:
7474:
7467:
7466:
7463:
7455:
7438:
7422:
7421:
7418:
7112:
7110:
7109:
7104:
7090:
7058:
7016:
7014:
7013:
7008:
6997:
6964:
6962:
6961:
6956:
6942:
6907:
6843:object hierarchy
6783:
6781:
6780:
6775:
6773:
6772:
6756:
6754:
6753:
6748:
6746:
6736:
6735:
6723:
6722:
6710:
6699:
6696:
6691:
6688:
6670:
6667:
6640:
6637:
6628:
6563:
6561:
6560:
6555:
6553:
6550:
6541:
6539:
6538:
6533:
6507:
6504:
6499:
6496:
6482:
6479:
6470:
6468:
6467:
6462:
6460:
6457:
6446:
6444:
6443:
6438:
6436:
6432:
6431:
6414:
6413:
6386:
6383:
6378:
6375:
6366:
6365:
6338:
6335:
6330:
6327:
6301:
6298:
6292:
6291:
6277:
6276:
6263:
6262:
6235:
6232:
6220:
6219:
6207:
6206:
6195:
6192:
6155:
6153:
6152:
6147:
6135:
6133:
6132:
6127:
6125:
6124:
6112:
6111:
6108:
6088:
6086:
6085:
6080:
6078:
6077:
5910:
5909:
5906:
5890:
5888:
5887:
5882:
5880:
5873:
5869:
5868:
5865:
5864:
5855:
5854:
5853:
5852:
5851:
5837:
5824:
5821:
5795:
5792:
5780:at other places:
5779:
5777:
5776:
5771:
5759:
5757:
5756:
5751:
5736:
5734:
5733:
5728:
5716:
5714:
5713:
5708:
5690:
5688:
5687:
5682:
5667:Masked attention
5662:
5660:
5659:
5654:
5652:
5651:
5628:
5619:
5618:
5602:
5600:
5599:
5594:
5570:
5568:
5567:
5562:
5554:
5553:
5550:
5535:
5534:
5531:
5516:
5515:
5512:
5496:
5494:
5493:
5488:
5486:
5485:
5482:
5466:
5464:
5463:
5458:
5456:
5455:
5439:
5437:
5436:
5431:
5419:
5417:
5416:
5411:
5408:
5403:
5390:
5385:
5372:
5367:
5351:
5349:
5348:
5343:
5331:
5329:
5328:
5323:
5321:
5320:
5304:
5299:
5283:
5278:
5262:
5257:
5242:
5239:
5234:
5233:
5229:
5228:
5225:
5209:
5206:
5179:
5176:
5167:
5165:
5164:
5159:
5132:
5130:
5129:
5124:
5122:
5118:
5117:
5116:
5104:
5103:
5091:
5090:
5045:
5043:
5042:
5037:
5035:
5034:
5031:
5022:
5021:
5018:
5009:
5008:
5005:
4992:
4990:
4989:
4984:
4982:
4981:
4978:
4969:
4968:
4965:
4956:
4955:
4952:
4936:
4934:
4933:
4928:
4926:
4925:
4922:
4913:
4912:
4909:
4899:
4898:
4895:
4886:
4885:
4882:
4872:
4871:
4868:
4859:
4858:
4855:
4842:
4840:
4839:
4834:
4832:
4831:
4828:
4812:
4810:
4809:
4804:
4802:
4801:
4798:
4782:
4780:
4779:
4774:
4772:
4771:
4768:
4752:
4750:
4749:
4744:
4742:
4741:
4738:
4717:
4715:
4714:
4709:
4707:
4700:
4696:
4693:
4692:
4683:
4682:
4681:
4680:
4679:
4665:
4659:
4656:
4630:
4627:
4614:
4612:
4611:
4606:
4604:
4603:
4587:
4585:
4584:
4579:
4577:
4576:
4560:
4558:
4557:
4552:
4550:
4549:
4533:
4531:
4530:
4525:
4513:
4511:
4510:
4505:
4493:
4491:
4490:
4485:
4473:
4471:
4470:
4465:
4452:softmax function
4446:
4444:
4443:
4438:
4426:
4424:
4423:
4418:
4416:
4415:
4396:
4394:
4393:
4388:
4376:
4374:
4373:
4368:
4366:
4365:
4353:
4352:
4336:
4334:
4333:
4328:
4316:
4314:
4313:
4308:
4296:
4294:
4293:
4288:
4286:
4285:
4273:
4272:
4256:
4254:
4253:
4248:
4236:
4234:
4233:
4228:
4216:
4214:
4213:
4208:
4206:
4205:
4189:
4187:
4186:
4181:
4179:
4178:
4158:
4156:
4155:
4150:
4148:
4146:
4145:
4136:
4127:
4125:
4124:
4119:
4117:
4116:
4100:
4098:
4097:
4092:
4090:
4089:
4069:
4067:
4066:
4061:
4049:
4047:
4046:
4041:
4029:
4027:
4026:
4021:
4019:
4018:
3996:
3994:
3993:
3988:
3986:
3985:
3982:
3973:
3972:
3969:
3956:
3954:
3953:
3948:
3946:
3945:
3933:
3932:
3920:
3919:
3900:
3898:
3897:
3892:
3890:
3889:
3880:
3879:
3876:
3857:
3855:
3854:
3849:
3847:
3846:
3837:
3836:
3833:
3814:
3812:
3811:
3806:
3804:
3803:
3794:
3793:
3790:
3771:
3769:
3768:
3763:
3761:
3760:
3751:
3750:
3749:
3746:
3730:
3729:
3713:
3711:
3710:
3705:
3703:
3702:
3686:
3684:
3683:
3678:
3676:
3675:
3674:
3671:
3652:For each vector
3648:
3646:
3645:
3640:
3638:
3637:
3634:
3621:
3619:
3618:
3613:
3611:
3610:
3607:
3591:
3589:
3588:
3583:
3581:
3580:
3564:
3562:
3561:
3556:
3554:
3553:
3537:
3535:
3534:
3529:
3527:
3526:
3468:
3466:
3465:
3460:
3458:
3457:
3454:
3442:
3441:
3438:
3424:feedforward size
3406:
3404:
3403:
3398:
3386:
3384:
3383:
3378:
3376:
3375:
3357:
3356:
3338:
3337:
3319:
3318:
3282:
3256:
3254:
3253:
3248:
3246:
3245:
3242:
3229:
3227:
3226:
3221:
3219:
3218:
3215:
3132:
3130:
3129:
3124:
3122:
3121:
3105:
3103:
3102:
3097:
3083:
3079:
3072:
3071:
3050:
3035:
3034:
3024:
3004:
3003:
2979:
2978:
2968:
2949:
2947:
2946:
2941:
2939:
2918:
2916:
2915:
2910:
2875:
2827:
2825:
2824:
2819:
2817:
2816:
2812:
2786:
2784:
2783:
2778:
2776:
2775:
2768:
2760:
2733:
2729:
2728:
2727:
2726:
2717:
2675:
2673:
2672:
2667:
2665:
2664:
2660:
2651:
2642:
2621:
2619:
2618:
2613:
2595:
2593:
2592:
2587:
2575:
2573:
2572:
2567:
2552:
2550:
2549:
2544:
2542:
2541:
2537:
2515:
2513:
2512:
2500:
2485:
2483:
2482:
2477:
2463:
2379:
2378:
2348:
2347:
2312:
2310:
2309:
2304:
2292:
2290:
2289:
2284:
2270:
2256:
2255:
2250:
2241:
2205:
2203:
2202:
2197:
2151:
2149:
2148:
2143:
2138:
2137:
2134:
2125:
2124:
2121:
2105:
2103:
2102:
2097:
2077:
2042:
1997:
1995:
1994:
1989:
1987:
1986:
1983:
1959:
1957:
1956:
1951:
1889:
1865:
1863:
1862:
1859:{\displaystyle }
1857:
1803:
1801:
1800:
1795:
1783:
1781:
1780:
1775:
1734:
1732:
1731:
1726:
1724:
1723:
1720:
1693:Lexical analysis
1678:
1676:
1675:
1670:
1617:masked attention
1591:
1589:
1588:
1583:
1578:
1575:
1570:
1567:
1555:
1554:
1551:
1532:
1529:
1519:log-perplexities
1455:natural language
1208:was revamped to
1206:Google Translate
960:
953:
946:
907:Related articles
784:Confusion matrix
537:Isolation forest
482:Graphical models
261:
260:
213:Learning to rank
208:Feature learning
46:Machine learning
37:
36:
21:
16485:
16484:
16480:
16479:
16478:
16476:
16475:
16474:
16465:Google software
16455:
16454:
16453:
16448:
16400:
16314:
16280:Google DeepMind
16258:
16224:Geoffrey Hinton
16183:
16120:
16046:Project Debater
15992:
15890:Implementations
15885:
15839:
15803:
15746:
15688:Backpropagation
15622:
15608:Tensor calculus
15562:
15559:
15529:
15524:
15496:
15449:
15432:
15390:Language models
15385:
15345:
15319:
15295:Neural networks
15279:
15240:
15213:
15184:
15129:
15125:Google DeepMind
15106:
15101:
15071:
15036:Wayback Machine
15023:
15021:Further reading
15018:
15017:
14970:
14966:
14946:
14942:
14922:
14918:
14900:
14893:
14881:
14879:
14870:
14869:
14861:
14854:
14845:
14843:
14835:
14834:
14830:
14812:
14808:
14791:
14787:
14770:
14766:
14747:
14743:
14734:
14732:
14724:
14723:
14719:
14682:
14678:
14660:
14656:
14639:
14635:
14618:
14614:
14605:
14603:
14590:
14589:
14585:
14565:
14561:
14532:
14528:
14519:
14517:
14504:
14503:
14499:
14481:
14477:
14460:
14453:
14433:
14429:
14418:
14414:
14394:
14387:
14371:
14370:
14364:
14362:
14352:
14348:
14340:
14338:
14331:
14330:
14326:
14319:
14283:
14279:
14261:
14257:
14239:
14235:
14226:
14224:
14216:
14215:
14211:
14202:
14200:
14191:
14190:
14186:
14177:
14175:
14169:"Stanford CRFM"
14167:
14166:
14162:
14147:: 16344–16359.
14133:
14129:
14109:
14105:
14088:
14084:
14067:
14063:
14046:
14042:
14022:
14018:
14003:
13999:
13944:
13940:
13906:
13899:
13870:
13866:
13849:
13845:
13828:
13824:
13815:
13813:
13806:Google Research
13800:
13799:
13795:
13758:
13749:
13729:
13725:
13696:
13692:
13683:
13681:
13646:
13642:
13633:
13631:
13621:
13617:
13608:
13606:
13591:
13582:
13573:
13571:
13558:
13557:
13553:
13536:
13527:
13507:
13496:
13487:
13485:
13477:
13476:
13469:
13460:
13458:
13450:
13449:
13442:
13424:
13420:
13383:
13379:
13361:
13354:
13334:
13330:
13317:
13313:
13293:
13289:
13271:
13264:
13246:
13239:
13231:
13229:
13222:
13221:
13217:
13208:
13206:
13193:
13192:
13188:
13179:
13177:
13169:
13168:
13164:
13155:
13153:
13151:research.google
13145:
13144:
13140:
13131:
13129:
13120:
13119:
13115:
13098:
13091:
13082:
13080:
13062:
13058:
13038:
13034:
13025:
13023:
13000:
12993:
12976:
12972:
12949:
12945:
12936:
12934:
12911:
12907:
12890:
12886:
12869:
12865:
12836:
12832:
12782:
12778:
12761:
12757:
12740:
12731:
12698:
12689:
12674:
12670:
12655:
12651:
12632:
12628:
12623:
12619:
12610:
12606:
12574:
12565:
12558:
12515:
12511:
12504:
12493:
12485:
12481:
12446:
12442:
12433:
12431:
12418:
12417:
12408:
12380:
12373:
12356:
12349:
12314:
12310:
12293:
12289:
12273:
12269:
12249:
12242:
12225:
12221:
12204:
12197:
12188:
12186:
12173:
12172:
12165:
12111:
12107:
12091:
12081:Vaswani, Ashish
12078:
12053:
12048:
12043:
12042:
12037:
12033:
12025:
12021:
12016:
11967:
11936:protein folding
11832:
11773:
11741:
11737:
11723:
11720:
11719:
11702:
11698:
11677:
11673:
11671:
11668:
11667:
11645:
11641:
11639:
11634:
11625:
11621:
11596:
11592:
11583:
11582:
11578:
11574:
11572:
11568:
11563:
11534:
11532:
11529:
11528:
11511:
11506:
11493:
11489:
11481:
11478:
11477:
11453:
11449:
11444:
11432:
11429:
11428:
11405:
11401:
11387:
11383:
11375:
11369:
11365:
11359:
11355:
11351:
11347:
11341:
11331:
11327:
11317:
11310:
11305:
11292:
11288:
11274:
11270:
11262:
11256:
11252:
11246:
11242:
11238:
11234:
11228:
11218:
11214:
11204:
11202:
11183:
11179:
11170:
11169:
11165:
11161:
11159:
11155:
11150:
11121:
11119:
11116:
11115:
11081:
11077:
11069:
11063:
11059:
11052:
11048:
11025:
11021:
11013:
11007:
11003:
10996:
10992:
10960:
10956:
10948:
10942:
10938:
10931:
10927:
10904:
10900:
10892:
10886:
10882:
10875:
10871:
10860:
10849:
10845:
10840:
10824:
10820:
10818:
10815:
10814:
10792:
10788:
10784:
10777:
10773:
10760:
10758:
10754:
10750:
10703:
10701:
10698:
10697:
10674:
10670:
10656:
10653:
10652:
10635:
10631:
10610:
10606:
10604:
10601:
10600:
10583:
10579:
10564:
10560:
10533:
10529:
10499:
10495:
10468:
10464:
10440:
10423:
10420:
10419:
10412:
10378:
10375:
10374:
10373:which grows as
10350:
10346:
10338:
10335:
10334:
10293:
10290:
10289:
10269:
10265:
10257:
10254:
10253:
10234:
10231:
10230:
10210:
10206:
10198:
10195:
10194:
10191:
10179:
10153:
10142:
10141:
10140:
10138:
10135:
10134:
10117:
10106:
10105:
10104:
10102:
10099:
10098:
10081:
10077:
10075:
10072:
10071:
10054:
10043:
10042:
10041:
10039:
10036:
10035:
10018:
10007:
10006:
10005:
10003:
10000:
9999:
9982:
9971:
9970:
9969:
9960:
9949:
9948:
9947:
9938:
9927:
9926:
9925:
9916:
9905:
9904:
9903:
9901:
9898:
9897:
9877:
9874:
9873:
9856:
9852:
9850:
9847:
9846:
9829:
9825:
9804:
9800:
9791:
9787:
9785:
9782:
9781:
9769:
9759:to KV caching.
9745:
9721:
9717:
9703:
9699:
9687:
9683:
9671:
9666:
9651:
9650:
9646:
9635:
9631:
9621:
9616:
9615:
9586:
9584:
9581:
9580:
9561:
9557:
9548:
9544:
9542:
9539:
9538:
9521:
9517:
9503:
9498:
9482:
9477:
9461:
9456:
9441:
9440:
9436:
9425:
9421:
9411:
9406:
9405:
9376:
9374:
9371:
9370:
9365:
9321:
9293:
9269:
9258:
9244:
9241:
9240:
9219:
9208:
9207:
9203:
9188:
9184:
9182:
9179:
9178:
9175:Toeplitz matrix
9158:
9155:
9154:
9140:
9139:
9116:
9112:
9103:
9102:
9098:
9094:
9092:
9091:
9087:
9082:
9053:
9049:
9047:
9044:
9043:
9040:
9012:
9009:
9008:
8992:
8989:
8988:
8953:
8949:
8947:
8944:
8943:
8925:
8924:
8919:
8914:
8909:
8904:
8898:
8897:
8892:
8887:
8879:
8871:
8862:
8861:
8856:
8851:
8846:
8838:
8829:
8828:
8823:
8818:
8813:
8808:
8799:
8798:
8793:
8788:
8783:
8778:
8768:
8767:
8759:
8756:
8755:
8735:
8732:
8731:
8715:
8712:
8711:
8697:
8696:
8670:
8666:
8657:
8656:
8652:
8648:
8646:
8645:
8641:
8636:
8607:
8603:
8601:
8598:
8597:
8586:
8565:
8562:
8561:
8544:
8543:
8522:
8521:
8516:
8510:
8504:
8503:
8502:
8481:
8480:
8475:
8466:
8465:
8450:
8449:
8444:
8438:
8432:
8431:
8430:
8415:
8414:
8409:
8407:
8404:
8403:
8377:
8373:
8346:
8342:
8340:
8337:
8336:
8317:
8314:
8313:
8296:
8292:
8280:
8276:
8267:
8266:
8254:
8250:
8244:
8243:
8238:
8236:
8233:
8232:
8209:
8204:
8182:
8177:
8164:
8160:
8158:
8155:
8154:
8136:
8135:
8111:
8106:
8075:
8070:
8063:
8062:
8038:
8033:
8002:
7997:
7986:
7985:
7975:
7974:
7962:
7957:
7950:
7949:
7937:
7932:
7921:
7920:
7913:
7912:
7898:
7883:
7882:
7865:
7846:
7845:
7836:
7835:
7817:
7812:
7793:
7788:
7778:
7777:
7772:
7770:
7767:
7766:
7750:
7747:
7746:
7705:
7700:
7681:
7676:
7651:
7646:
7627:
7622:
7597:
7592:
7573:
7568:
7556:
7553:
7552:
7549:
7537:
7521:
7502:
7497:
7495:Subsequent work
7469:
7468:
7462:
7458:
7456:
7451:
7448:
7447:
7439:
7434:
7427:
7426:
7417:
7413:
7411:
7408:
7407:
7393:text generation
7365:text generation
7346:
7341:
7120:
7062:
7033:
7025:
7022:
7021:
6972:
6970:
6967:
6966:
6917:
6879:
6877:
6874:
6873:
6799:
6794:
6768:
6764:
6762:
6759:
6758:
6744:
6743:
6731:
6727:
6718:
6714:
6703:
6695:
6687:
6680:
6666:
6663:
6662:
6636:
6629:
6621:
6617:
6615:
6612:
6611:
6577:
6549:
6547:
6544:
6543:
6503:
6495:
6478:
6476:
6473:
6472:
6456:
6454:
6451:
6450:
6434:
6433:
6426:
6425:
6419:
6418:
6409:
6405:
6382:
6374:
6371:
6370:
6361:
6357:
6334:
6326:
6319:
6318:
6311:
6297:
6294:
6293:
6286:
6285:
6279:
6278:
6272:
6268:
6265:
6264:
6258:
6254:
6247:
6246:
6239:
6231:
6228:
6227:
6215:
6211:
6202:
6198:
6196:
6191:
6187:
6185:
6182:
6181:
6166:
6141:
6138:
6137:
6117:
6113:
6107:
6103:
6098:
6095:
6094:
6072:
6071:
6066:
6061:
6056:
6051:
6045:
6044:
6039:
6034:
6029:
6024:
6018:
6017:
6009:
6004:
5999:
5994:
5988:
5987:
5979:
5974:
5966:
5961:
5955:
5954:
5946:
5941:
5933:
5925:
5915:
5914:
5905:
5901:
5899:
5896:
5895:
5878:
5877:
5860:
5856:
5847:
5846:
5842:
5838:
5836:
5829:
5825:
5820:
5793:MaskedAttention
5791:
5787:
5785:
5782:
5781:
5765:
5762:
5761:
5742:
5739:
5738:
5722:
5719:
5718:
5696:
5693:
5692:
5676:
5673:
5672:
5669:
5629:
5624:
5623:
5614:
5610:
5608:
5605:
5604:
5576:
5573:
5572:
5549:
5545:
5530:
5526:
5511:
5507:
5505:
5502:
5501:
5481:
5477:
5475:
5472:
5471:
5451:
5447:
5445:
5442:
5441:
5425:
5422:
5421:
5404:
5399:
5386:
5381:
5368:
5363:
5357:
5354:
5353:
5337:
5334:
5333:
5316:
5312:
5300:
5295:
5279:
5274:
5258:
5253:
5238:
5224:
5220:
5210:
5205:
5204:
5175:
5173:
5170:
5169:
5153:
5150:
5149:
5112:
5108:
5099:
5095:
5086:
5082:
5081:
5077:
5075:
5072:
5071:
5052:
5030:
5026:
5017:
5013:
5004:
5000:
4998:
4995:
4994:
4977:
4973:
4964:
4960:
4951:
4947:
4945:
4942:
4941:
4921:
4917:
4908:
4904:
4894:
4890:
4881:
4877:
4867:
4863:
4854:
4850:
4848:
4845:
4844:
4827:
4823:
4821:
4818:
4817:
4797:
4793:
4791:
4788:
4787:
4767:
4763:
4761:
4758:
4757:
4737:
4733:
4731:
4728:
4727:
4705:
4704:
4688:
4684:
4675:
4674:
4670:
4666:
4664:
4660:
4655:
4626:
4622:
4620:
4617:
4616:
4599:
4595:
4593:
4590:
4589:
4572:
4568:
4566:
4563:
4562:
4545:
4541:
4539:
4536:
4535:
4519:
4516:
4515:
4499:
4496:
4495:
4479:
4476:
4475:
4459:
4456:
4455:
4447:to each token.
4432:
4429:
4428:
4408:
4404:
4402:
4399:
4398:
4382:
4379:
4378:
4361:
4357:
4348:
4344:
4342:
4339:
4338:
4322:
4319:
4318:
4302:
4299:
4298:
4281:
4277:
4268:
4264:
4262:
4259:
4258:
4242:
4239:
4238:
4222:
4219:
4218:
4201:
4197:
4195:
4192:
4191:
4174:
4170:
4168:
4165:
4164:
4141:
4137:
4135:
4133:
4130:
4129:
4112:
4108:
4106:
4103:
4102:
4085:
4081:
4079:
4076:
4075:
4055:
4052:
4051:
4035:
4032:
4031:
4011:
4007:
4005:
4002:
4001:
3981:
3977:
3968:
3964:
3962:
3959:
3958:
3941:
3937:
3928:
3924:
3915:
3911:
3909:
3906:
3905:
3885:
3881:
3875:
3871:
3863:
3860:
3859:
3842:
3838:
3832:
3828:
3820:
3817:
3816:
3799:
3795:
3789:
3785:
3777:
3774:
3773:
3756:
3752:
3745:
3738:
3734:
3725:
3721:
3719:
3716:
3715:
3698:
3694:
3692:
3689:
3688:
3670:
3663:
3659:
3657:
3654:
3653:
3633:
3629:
3627:
3624:
3623:
3606:
3602:
3600:
3597:
3596:
3576:
3572:
3570:
3567:
3566:
3549:
3545:
3543:
3540:
3539:
3522:
3518:
3516:
3513:
3512:
3486:
3481:
3475:
3453:
3449:
3437:
3433:
3431:
3428:
3427:
3392:
3389:
3388:
3365:
3361:
3346:
3342:
3327:
3323:
3308:
3304:
3272:
3270:
3267:
3266:
3241:
3237:
3235:
3232:
3231:
3214:
3210:
3208:
3205:
3204:
3197:
3177:encoder-decoder
3153:
3117:
3113:
3111:
3108:
3107:
3067:
3063:
3037:
3030:
3026:
3020:
3015:
3011:
2999:
2995:
2974:
2970:
2964:
2958:
2955:
2954:
2935:
2924:
2921:
2920:
2862:
2836:
2833:
2832:
2808:
2804:
2800:
2792:
2789:
2788:
2759:
2734:
2722:
2718:
2713:
2706:
2702:
2698:
2697:
2680:
2677:
2676:
2656:
2652:
2647:
2646:
2638:
2630:
2627:
2626:
2601:
2598:
2597:
2581:
2578:
2577:
2561:
2558:
2557:
2533:
2529:
2525:
2508:
2504:
2499:
2491:
2488:
2487:
2459:
2365:
2361:
2340:
2336:
2322:
2319:
2318:
2298:
2295:
2294:
2266:
2251:
2246:
2245:
2237:
2229:
2226:
2225:
2173:
2170:
2169:
2164:A diagram of a
2158:
2133:
2129:
2120:
2116:
2111:
2108:
2107:
2055:
2020:
2018:
2015:
2014:
2003:
1982:
1978:
1976:
1973:
1972:
1971:and written as
1873:
1871:
1868:
1867:
1809:
1806:
1805:
1789:
1786:
1785:
1769:
1766:
1765:
1754:
1748:
1719:
1715:
1713:
1710:
1709:
1707:vocabulary size
1695:
1687:Main articles:
1685:
1661:
1658:
1657:
1629:
1574:
1566:
1550:
1543:
1528:
1526:
1523:
1522:
1507:
1501:
1399:
1380:
1375:
1309:word embeddings
1289:
1241:
1235:
1219:avant la lettre
1153:
1147:
1094:
1089:
1083:
1051:computer vision
1019:(RNNs) such as
964:
935:
934:
908:
900:
899:
860:
852:
851:
812:Kernel machines
807:
799:
798:
774:
766:
765:
746:Active learning
741:
733:
732:
701:
691:
690:
616:Diffusion model
552:
542:
541:
514:
504:
503:
477:
467:
466:
422:Factor analysis
417:
407:
406:
390:
353:
343:
342:
263:
262:
246:
245:
244:
233:
232:
138:
130:
129:
95:Online learning
60:
48:
35:
28:
23:
22:
15:
12:
11:
5:
16483:
16473:
16472:
16467:
16450:
16449:
16447:
16446:
16445:
16444:
16439:
16426:
16425:
16424:
16419:
16405:
16402:
16401:
16399:
16398:
16393:
16388:
16383:
16378:
16373:
16368:
16363:
16358:
16353:
16348:
16343:
16338:
16333:
16328:
16322:
16320:
16316:
16315:
16313:
16312:
16307:
16302:
16297:
16292:
16287:
16282:
16277:
16272:
16266:
16264:
16260:
16259:
16257:
16256:
16254:Ilya Sutskever
16251:
16246:
16241:
16236:
16231:
16226:
16221:
16219:Demis Hassabis
16216:
16211:
16209:Ian Goodfellow
16206:
16201:
16195:
16193:
16189:
16188:
16185:
16184:
16182:
16181:
16176:
16175:
16174:
16164:
16159:
16154:
16149:
16144:
16139:
16134:
16128:
16126:
16122:
16121:
16119:
16118:
16113:
16108:
16103:
16098:
16093:
16088:
16083:
16078:
16073:
16068:
16063:
16058:
16053:
16048:
16043:
16038:
16037:
16036:
16026:
16021:
16016:
16011:
16006:
16000:
15998:
15994:
15993:
15991:
15990:
15985:
15984:
15983:
15978:
15968:
15967:
15966:
15961:
15956:
15946:
15941:
15936:
15931:
15926:
15921:
15916:
15911:
15906:
15900:
15898:
15891:
15887:
15886:
15884:
15883:
15878:
15873:
15868:
15863:
15858:
15853:
15847:
15845:
15841:
15840:
15838:
15837:
15832:
15827:
15822:
15817:
15811:
15809:
15805:
15804:
15802:
15801:
15800:
15799:
15792:Language model
15789:
15784:
15779:
15778:
15777:
15767:
15766:
15765:
15754:
15752:
15748:
15747:
15745:
15744:
15742:Autoregression
15739:
15734:
15733:
15732:
15722:
15720:Regularization
15717:
15716:
15715:
15710:
15705:
15695:
15690:
15685:
15683:Loss functions
15680:
15675:
15670:
15665:
15660:
15659:
15658:
15648:
15643:
15642:
15641:
15630:
15628:
15624:
15623:
15621:
15620:
15618:Inductive bias
15615:
15610:
15605:
15600:
15595:
15590:
15585:
15580:
15572:
15570:
15564:
15563:
15558:
15557:
15550:
15543:
15535:
15526:
15525:
15523:
15522:
15512:
15501:
15498:
15497:
15495:
15494:
15489:
15484:
15479:
15474:
15469:
15461:
15459:
15455:
15454:
15451:
15450:
15448:
15447:
15440:
15438:
15434:
15433:
15431:
15430:
15424:
15418:
15412:
15406:
15400:
15393:
15391:
15387:
15386:
15384:
15383:
15377:
15371:
15364:
15362:
15355:
15351:
15350:
15347:
15346:
15344:
15343:
15338:
15333:
15327:
15325:
15321:
15320:
15318:
15317:
15311:
15305:
15298:
15296:
15289:
15285:
15284:
15281:
15280:
15278:
15277:
15271:
15265:
15259:
15252:
15250:
15246:
15245:
15242:
15241:
15239:
15238:
15230:
15221:
15219:
15215:
15214:
15212:
15211:
15205:
15199:
15192:
15190:
15186:
15185:
15183:
15182:
15176:
15170:
15164:
15158:
15151:
15149:
15142:
15135:
15131:
15130:
15128:
15127:
15122:
15117:
15111:
15108:
15107:
15100:
15099:
15092:
15085:
15077:
15070:
15069:
15054:
15039:
15024:
15022:
15019:
15016:
15015:
14964:
14940:
14916:
14891:
14882:|journal=
14852:
14828:
14806:
14785:
14764:
14761:: 34892–34916.
14741:
14717:
14676:
14654:
14633:
14612:
14596:Google AI Blog
14583:
14559:
14526:
14510:Google AI Blog
14497:
14475:
14451:
14427:
14412:
14385:
14346:
14324:
14317:
14277:
14255:
14233:
14209:
14184:
14160:
14127:
14103:
14082:
14061:
14040:
14016:
13997:
13958:(3): 733–763.
13938:
13897:
13864:
13843:
13822:
13793:
13747:
13723:
13690:
13640:
13615:
13593:Alammar, Jay.
13580:
13566:. 2016-04-18.
13551:
13525:
13494:
13483:huggingface.co
13467:
13456:huggingface.co
13440:
13418:
13377:
13352:
13328:
13311:
13287:
13262:
13237:
13215:
13186:
13162:
13138:
13113:
13089:
13070:The New Yorker
13056:
13032:
13002:Levy, Steven.
12991:
12970:
12943:
12933:on 24 May 2023
12905:
12884:
12863:
12830:
12776:
12755:
12729:
12687:
12668:
12649:
12626:
12617:
12604:
12585:(1): 131–139.
12556:
12523:Applied Optics
12509:
12502:
12479:
12460:(3): 205–254.
12440:
12424:Google AI Blog
12406:
12371:
12347:
12328:(2): 576–583.
12308:
12287:
12267:
12240:
12219:
12195:
12181:. 2019-02-14.
12163:
12105:
12085:Gomez, Aidan N
12050:
12049:
12047:
12044:
12041:
12040:
12031:
12018:
12017:
12015:
12012:
12011:
12010:
12004:
11998:
11992:
11986:
11980:
11974:
11966:
11963:
11962:
11961:
11943:
11933:
11928:
11919:
11918:
11916:speech-to-text
11913:
11907:
11901:
11896:
11891:
11885:
11831:
11828:
11772:
11769:
11752:
11749:
11744:
11740:
11736:
11733:
11730:
11727:
11705:
11701:
11697:
11694:
11691:
11688:
11685:
11680:
11676:
11655:
11648:
11644:
11637:
11633:
11628:
11624:
11620:
11617:
11614:
11611:
11607:
11599:
11595:
11586:
11581:
11577:
11571:
11562:
11559:
11556:
11553:
11550:
11547:
11544:
11541:
11514:
11509:
11505:
11501:
11496:
11492:
11488:
11485:
11460:
11456:
11452:
11447:
11443:
11439:
11436:
11413:
11408:
11404:
11400:
11397:
11390:
11386:
11382:
11378:
11372:
11368:
11362:
11358:
11354:
11350:
11344:
11340:
11334:
11330:
11326:
11323:
11320:
11313:
11308:
11304:
11300:
11295:
11291:
11287:
11284:
11277:
11273:
11269:
11265:
11259:
11255:
11249:
11245:
11241:
11237:
11231:
11227:
11221:
11217:
11213:
11210:
11207:
11201:
11198:
11194:
11186:
11182:
11173:
11168:
11164:
11158:
11149:
11146:
11143:
11140:
11137:
11134:
11131:
11128:
11103:
11100:
11097:
11094:
11091:
11084:
11080:
11076:
11072:
11066:
11062:
11058:
11055:
11051:
11047:
11044:
11041:
11038:
11035:
11028:
11024:
11020:
11016:
11010:
11006:
11002:
10999:
10995:
10991:
10988:
10985:
10982:
10979:
10976:
10973:
10970:
10963:
10959:
10955:
10951:
10945:
10941:
10937:
10934:
10930:
10926:
10923:
10920:
10917:
10914:
10907:
10903:
10899:
10895:
10889:
10885:
10881:
10878:
10874:
10870:
10867:
10863:
10859:
10852:
10848:
10843:
10839:
10836:
10833:
10830:
10827:
10823:
10795:
10791:
10787:
10780:
10776:
10772:
10769:
10766:
10763:
10757:
10753:
10749:
10746:
10743:
10740:
10737:
10734:
10731:
10728:
10725:
10722:
10719:
10716:
10713:
10710:
10706:
10685:
10682:
10677:
10673:
10669:
10666:
10663:
10660:
10638:
10634:
10630:
10627:
10624:
10621:
10618:
10613:
10609:
10586:
10582:
10578:
10575:
10572:
10567:
10563:
10559:
10556:
10553:
10550:
10547:
10544:
10541:
10536:
10532:
10528:
10525:
10522:
10519:
10516:
10513:
10510:
10507:
10502:
10498:
10494:
10491:
10488:
10485:
10482:
10479:
10476:
10471:
10467:
10463:
10460:
10457:
10454:
10448:
10444:
10439:
10436:
10433:
10430:
10427:
10411:
10408:
10391:
10388:
10385:
10382:
10358:
10353:
10349:
10345:
10342:
10315:
10312:
10309:
10306:
10303:
10300:
10297:
10277:
10272:
10268:
10264:
10261:
10238:
10218:
10213:
10209:
10205:
10202:
10190:
10187:
10178:
10175:
10156:
10149:
10146:
10120:
10113:
10110:
10084:
10080:
10057:
10050:
10047:
10021:
10014:
10011:
9985:
9978:
9975:
9968:
9963:
9956:
9953:
9946:
9941:
9934:
9931:
9924:
9919:
9912:
9909:
9881:
9859:
9855:
9832:
9828:
9824:
9821:
9818:
9815:
9812:
9807:
9803:
9799:
9794:
9790:
9768:
9765:
9753:PagedAttention
9744:
9741:
9724:
9720:
9715:
9711:
9706:
9702:
9698:
9695:
9690:
9686:
9682:
9679:
9674:
9669:
9665:
9661:
9658:
9649:
9643:
9634:
9630:
9627:
9624:
9614:
9611:
9608:
9605:
9602:
9599:
9596:
9593:
9564:
9560:
9556:
9551:
9547:
9524:
9520:
9515:
9511:
9506:
9501:
9497:
9493:
9490:
9485:
9480:
9476:
9472:
9469:
9464:
9459:
9455:
9451:
9448:
9439:
9433:
9424:
9420:
9417:
9414:
9404:
9401:
9398:
9395:
9392:
9389:
9386:
9383:
9364:
9361:
9320:
9319:FlashAttention
9317:
9292:
9289:
9275:
9272:
9268:
9264:
9261:
9257:
9254:
9251:
9248:
9225:
9222:
9218:
9214:
9211:
9206:
9202:
9197:
9194:
9191:
9187:
9162:
9138:
9134:
9130:
9127:
9119:
9115:
9106:
9101:
9097:
9090:
9081:
9078:
9075:
9072:
9069:
9066:
9063:
9060:
9052:
9051:
9039:
9036:
9019:
9016:
8996:
8976:
8973:
8970:
8967:
8962:
8959:
8956:
8952:
8929:
8923:
8920:
8918:
8915:
8913:
8910:
8908:
8905:
8903:
8900:
8899:
8896:
8893:
8891:
8888:
8886:
8883:
8880:
8878:
8875:
8872:
8870:
8867:
8864:
8863:
8860:
8857:
8855:
8852:
8850:
8847:
8845:
8842:
8839:
8837:
8834:
8831:
8830:
8827:
8824:
8822:
8819:
8817:
8814:
8812:
8809:
8807:
8804:
8801:
8800:
8797:
8794:
8792:
8789:
8787:
8784:
8782:
8779:
8777:
8774:
8773:
8771:
8766:
8763:
8739:
8719:
8695:
8691:
8687:
8684:
8681:
8673:
8669:
8660:
8655:
8651:
8644:
8635:
8632:
8629:
8626:
8623:
8620:
8617:
8614:
8606:
8605:
8585:
8582:
8569:
8547:
8542:
8539:
8536:
8533:
8530:
8525:
8513:
8507:
8501:
8498:
8495:
8492:
8489:
8484:
8474:
8469:
8464:
8461:
8458:
8453:
8441:
8435:
8429:
8426:
8423:
8418:
8386:
8383:
8380:
8376:
8372:
8369:
8366:
8363:
8360:
8355:
8352:
8349:
8345:
8324:
8321:
8312:For a list of
8299:
8295:
8289:
8286:
8283:
8279:
8275:
8270:
8265:
8262:
8257:
8253:
8247:
8218:
8215:
8212:
8207:
8203:
8199:
8196:
8191:
8188:
8185:
8180:
8176:
8172:
8167:
8163:
8140:
8134:
8131:
8128:
8125:
8120:
8117:
8114:
8109:
8105:
8101:
8098:
8095:
8092:
8089:
8084:
8081:
8078:
8073:
8069:
8065:
8064:
8061:
8058:
8055:
8052:
8047:
8044:
8041:
8036:
8032:
8028:
8025:
8022:
8019:
8016:
8011:
8008:
8005:
8000:
7996:
7992:
7991:
7989:
7984:
7979:
7971:
7968:
7965:
7960:
7956:
7952:
7951:
7946:
7943:
7940:
7935:
7931:
7927:
7926:
7924:
7917:
7911:
7908:
7905:
7902:
7899:
7897:
7894:
7891:
7888:
7885:
7884:
7881:
7878:
7875:
7872:
7869:
7866:
7864:
7861:
7858:
7855:
7852:
7851:
7849:
7844:
7839:
7834:
7831:
7826:
7823:
7820:
7815:
7811:
7807:
7802:
7799:
7796:
7791:
7787:
7781:
7754:
7734:
7731:
7728:
7725:
7722:
7719:
7714:
7711:
7708:
7703:
7699:
7695:
7690:
7687:
7684:
7679:
7675:
7671:
7668:
7665:
7660:
7657:
7654:
7649:
7645:
7641:
7636:
7633:
7630:
7625:
7621:
7617:
7614:
7611:
7606:
7603:
7600:
7595:
7591:
7587:
7582:
7579:
7576:
7571:
7567:
7563:
7560:
7548:
7545:
7536:
7533:
7520:
7517:
7501:
7498:
7496:
7493:
7473:
7461:
7457:
7454:
7450:
7449:
7446:
7443:
7440:
7437:
7433:
7432:
7430:
7425:
7416:
7345:
7342:
7332:1:length(z_d)
7320:1:length(z_d)
7308:1:length(z_d)
7296:1:length(z_d)
7284:1:length(z_d)
7272:1:length(z_d)
7260:1:length(z_d)
7233:1:length(z_d)
7218:1:length(z_e)
7206:1:length(z_e)
7194:1:length(z_e)
7182:1:length(z_e)
7170:1:length(z_e)
7143:1:length(z_e)
7124:
7119:
7116:
7102:
7099:
7096:
7093:
7089:
7086:
7083:
7080:
7077:
7074:
7071:
7068:
7065:
7061:
7057:
7054:
7051:
7048:
7045:
7042:
7039:
7036:
7032:
7029:
7006:
7003:
7000:
6996:
6993:
6990:
6987:
6984:
6981:
6978:
6975:
6954:
6951:
6948:
6945:
6941:
6938:
6935:
6932:
6929:
6926:
6923:
6920:
6916:
6913:
6910:
6906:
6903:
6900:
6897:
6894:
6891:
6888:
6885:
6882:
6798:
6795:
6793:
6790:
6771:
6767:
6742:
6739:
6734:
6730:
6726:
6721:
6717:
6713:
6709:
6706:
6702:
6694:
6686:
6683:
6681:
6679:
6676:
6673:
6665:
6664:
6661:
6658:
6655:
6652:
6649:
6646:
6643:
6635:
6632:
6630:
6627:
6624:
6620:
6619:
6601:autoregressive
6576:
6573:
6531:
6528:
6525:
6522:
6519:
6516:
6513:
6510:
6502:
6494:
6491:
6488:
6485:
6430:
6424:
6421:
6420:
6417:
6412:
6408:
6404:
6401:
6398:
6395:
6392:
6389:
6381:
6373:
6372:
6369:
6364:
6360:
6356:
6353:
6350:
6347:
6344:
6341:
6333:
6325:
6324:
6322:
6317:
6314:
6312:
6310:
6307:
6304:
6296:
6295:
6290:
6284:
6281:
6280:
6275:
6271:
6267:
6266:
6261:
6257:
6253:
6252:
6250:
6245:
6242:
6240:
6238:
6230:
6229:
6226:
6223:
6218:
6214:
6210:
6205:
6201:
6197:
6190:
6189:
6165:
6162:
6145:
6123:
6120:
6116:
6106:
6102:
6076:
6070:
6067:
6065:
6062:
6060:
6057:
6055:
6052:
6050:
6047:
6046:
6043:
6040:
6038:
6035:
6033:
6030:
6028:
6025:
6023:
6020:
6019:
6016:
6013:
6010:
6008:
6005:
6003:
6000:
5998:
5995:
5993:
5990:
5989:
5986:
5983:
5980:
5978:
5975:
5973:
5970:
5967:
5965:
5962:
5960:
5957:
5956:
5953:
5950:
5947:
5945:
5942:
5940:
5937:
5934:
5932:
5929:
5926:
5924:
5921:
5920:
5918:
5913:
5904:
5876:
5872:
5863:
5859:
5850:
5845:
5841:
5835:
5832:
5828:
5819:
5816:
5813:
5810:
5807:
5804:
5801:
5798:
5790:
5789:
5769:
5749:
5746:
5726:
5706:
5703:
5700:
5680:
5668:
5665:
5650:
5647:
5644:
5641:
5638:
5635:
5632:
5627:
5622:
5617:
5613:
5592:
5589:
5586:
5583:
5580:
5560:
5557:
5548:
5544:
5541:
5538:
5529:
5525:
5522:
5519:
5510:
5480:
5454:
5450:
5429:
5407:
5402:
5398:
5394:
5389:
5384:
5380:
5376:
5371:
5366:
5362:
5341:
5319:
5315:
5311:
5308:
5303:
5298:
5294:
5290:
5287:
5282:
5277:
5273:
5269:
5266:
5261:
5256:
5252:
5248:
5245:
5237:
5232:
5223:
5219:
5216:
5213:
5203:
5200:
5197:
5194:
5191:
5188:
5185:
5182:
5168:, then we have
5157:
5135:attention head
5121:
5115:
5111:
5107:
5102:
5098:
5094:
5089:
5085:
5080:
5051:
5048:
5029:
5025:
5016:
5012:
5003:
4976:
4972:
4963:
4959:
4950:
4920:
4916:
4907:
4902:
4893:
4889:
4880:
4875:
4866:
4862:
4853:
4826:
4815:head dimension
4796:
4766:
4736:
4703:
4699:
4691:
4687:
4678:
4673:
4669:
4663:
4654:
4651:
4648:
4645:
4642:
4639:
4636:
4633:
4625:
4624:
4602:
4598:
4575:
4571:
4548:
4544:
4523:
4503:
4483:
4463:
4436:
4414:
4411:
4407:
4386:
4364:
4360:
4356:
4351:
4347:
4326:
4306:
4284:
4280:
4276:
4271:
4267:
4246:
4226:
4204:
4200:
4177:
4173:
4144:
4140:
4115:
4111:
4088:
4084:
4059:
4039:
4017:
4014:
4010:
3980:
3976:
3967:
3944:
3940:
3936:
3931:
3927:
3923:
3918:
3914:
3888:
3884:
3874:
3870:
3867:
3845:
3841:
3831:
3827:
3824:
3802:
3798:
3788:
3784:
3781:
3759:
3755:
3744:
3741:
3737:
3733:
3728:
3724:
3701:
3697:
3669:
3666:
3662:
3632:
3605:
3579:
3575:
3552:
3548:
3525:
3521:
3485:
3484:Attention head
3482:
3477:Main article:
3474:
3471:
3452:
3448:
3445:
3436:
3396:
3374:
3371:
3368:
3364:
3360:
3355:
3352:
3349:
3345:
3341:
3336:
3333:
3330:
3326:
3322:
3317:
3314:
3311:
3307:
3303:
3300:
3297:
3294:
3291:
3288:
3285:
3281:
3278:
3275:
3240:
3213:
3196:
3193:
3152:
3149:
3138:language model
3120:
3116:
3095:
3092:
3089:
3086:
3082:
3078:
3075:
3070:
3066:
3062:
3059:
3056:
3053:
3049:
3046:
3043:
3040:
3033:
3029:
3023:
3019:
3014:
3010:
3007:
3002:
2998:
2994:
2991:
2988:
2985:
2982:
2977:
2973:
2967:
2963:
2938:
2934:
2931:
2928:
2908:
2905:
2902:
2899:
2896:
2893:
2890:
2887:
2884:
2881:
2878:
2874:
2871:
2868:
2865:
2861:
2858:
2855:
2852:
2849:
2846:
2843:
2840:
2815:
2811:
2807:
2803:
2799:
2796:
2774:
2771:
2766:
2763:
2758:
2755:
2752:
2749:
2746:
2743:
2740:
2737:
2732:
2725:
2721:
2716:
2712:
2709:
2705:
2701:
2696:
2693:
2690:
2687:
2684:
2663:
2659:
2655:
2650:
2645:
2641:
2637:
2634:
2611:
2608:
2605:
2585:
2565:
2540:
2536:
2532:
2528:
2524:
2521:
2518:
2511:
2507:
2503:
2498:
2495:
2475:
2472:
2469:
2466:
2462:
2458:
2455:
2452:
2449:
2446:
2443:
2440:
2437:
2434:
2431:
2428:
2424:
2421:
2418:
2415:
2412:
2409:
2406:
2403:
2400:
2397:
2394:
2391:
2388:
2385:
2382:
2377:
2374:
2371:
2368:
2364:
2360:
2357:
2354:
2351:
2346:
2343:
2339:
2335:
2332:
2329:
2326:
2302:
2282:
2279:
2276:
2273:
2269:
2265:
2262:
2259:
2254:
2249:
2244:
2240:
2236:
2233:
2195:
2192:
2189:
2186:
2183:
2180:
2177:
2157:
2154:
2141:
2132:
2128:
2119:
2115:
2095:
2092:
2089:
2086:
2083:
2080:
2076:
2073:
2070:
2067:
2064:
2061:
2058:
2054:
2051:
2048:
2045:
2041:
2038:
2035:
2032:
2029:
2026:
2023:
2002:
1999:
1981:
1969:embedding size
1949:
1946:
1943:
1940:
1937:
1934:
1931:
1928:
1925:
1922:
1919:
1916:
1913:
1910:
1907:
1904:
1901:
1898:
1895:
1892:
1888:
1885:
1882:
1879:
1876:
1855:
1852:
1849:
1846:
1843:
1840:
1837:
1834:
1831:
1828:
1825:
1822:
1819:
1816:
1813:
1793:
1773:
1752:Word embedding
1747:
1744:
1718:
1684:
1681:
1668:
1665:
1647:
1646:
1643:
1639:
1636:
1628:
1625:
1581:
1573:
1565:
1562:
1559:
1549:
1546:
1542:
1538:
1535:
1500:
1497:
1492:
1491:
1484:
1477:
1451:T5 transformer
1447:
1446:
1441:
1436:
1431:
1426:
1423:
1398:
1395:
1379:
1376:
1374:
1371:
1288:
1285:
1237:Main article:
1234:
1231:
1149:Main article:
1146:
1143:
1093:
1090:
1082:
1079:
1009:contextualized
1005:word embedding
966:
965:
963:
962:
955:
948:
940:
937:
936:
933:
932:
927:
926:
925:
915:
909:
906:
905:
902:
901:
898:
897:
892:
887:
882:
877:
872:
867:
861:
858:
857:
854:
853:
850:
849:
844:
839:
834:
832:Occam learning
829:
824:
819:
814:
808:
805:
804:
801:
800:
797:
796:
791:
789:Learning curve
786:
781:
775:
772:
771:
768:
767:
764:
763:
758:
753:
748:
742:
739:
738:
735:
734:
731:
730:
729:
728:
718:
713:
708:
702:
697:
696:
693:
692:
689:
688:
682:
677:
672:
667:
666:
665:
655:
650:
649:
648:
643:
638:
633:
623:
618:
613:
608:
607:
606:
596:
595:
594:
589:
584:
579:
569:
564:
559:
553:
548:
547:
544:
543:
540:
539:
534:
529:
521:
515:
510:
509:
506:
505:
502:
501:
500:
499:
494:
489:
478:
473:
472:
469:
468:
465:
464:
459:
454:
449:
444:
439:
434:
429:
424:
418:
413:
412:
409:
408:
405:
404:
399:
394:
388:
383:
378:
370:
365:
360:
354:
349:
348:
345:
344:
341:
340:
335:
330:
325:
320:
315:
310:
305:
297:
296:
295:
290:
285:
275:
273:Decision trees
270:
264:
250:classification
240:
239:
238:
235:
234:
231:
230:
225:
220:
215:
210:
205:
200:
195:
190:
185:
180:
175:
170:
165:
160:
155:
150:
145:
143:Classification
139:
136:
135:
132:
131:
128:
127:
122:
117:
112:
107:
102:
100:Batch learning
97:
92:
87:
82:
77:
72:
67:
61:
58:
57:
54:
53:
42:
41:
26:
9:
6:
4:
3:
2:
16482:
16471:
16468:
16466:
16463:
16462:
16460:
16443:
16440:
16438:
16435:
16434:
16427:
16423:
16420:
16418:
16415:
16414:
16411:
16407:
16406:
16403:
16397:
16394:
16392:
16389:
16387:
16384:
16382:
16379:
16377:
16374:
16372:
16369:
16367:
16364:
16362:
16359:
16357:
16354:
16352:
16349:
16347:
16344:
16342:
16339:
16337:
16334:
16332:
16329:
16327:
16324:
16323:
16321:
16319:Architectures
16317:
16311:
16308:
16306:
16303:
16301:
16298:
16296:
16293:
16291:
16288:
16286:
16283:
16281:
16278:
16276:
16273:
16271:
16268:
16267:
16265:
16263:Organizations
16261:
16255:
16252:
16250:
16247:
16245:
16242:
16240:
16237:
16235:
16232:
16230:
16227:
16225:
16222:
16220:
16217:
16215:
16212:
16210:
16207:
16205:
16202:
16200:
16199:Yoshua Bengio
16197:
16196:
16194:
16190:
16180:
16179:Robot control
16177:
16173:
16170:
16169:
16168:
16165:
16163:
16160:
16158:
16155:
16153:
16150:
16148:
16145:
16143:
16140:
16138:
16135:
16133:
16130:
16129:
16127:
16123:
16117:
16114:
16112:
16109:
16107:
16104:
16102:
16099:
16097:
16096:Chinchilla AI
16094:
16092:
16089:
16087:
16084:
16082:
16079:
16077:
16074:
16072:
16069:
16067:
16064:
16062:
16059:
16057:
16054:
16052:
16049:
16047:
16044:
16042:
16039:
16035:
16032:
16031:
16030:
16027:
16025:
16022:
16020:
16017:
16015:
16012:
16010:
16007:
16005:
16002:
16001:
15999:
15995:
15989:
15986:
15982:
15979:
15977:
15974:
15973:
15972:
15969:
15965:
15962:
15960:
15957:
15955:
15952:
15951:
15950:
15947:
15945:
15942:
15940:
15937:
15935:
15932:
15930:
15927:
15925:
15922:
15920:
15917:
15915:
15912:
15910:
15907:
15905:
15902:
15901:
15899:
15895:
15892:
15888:
15882:
15879:
15877:
15874:
15872:
15869:
15867:
15864:
15862:
15859:
15857:
15854:
15852:
15849:
15848:
15846:
15842:
15836:
15833:
15831:
15828:
15826:
15823:
15821:
15818:
15816:
15813:
15812:
15810:
15806:
15798:
15795:
15794:
15793:
15790:
15788:
15785:
15783:
15780:
15776:
15775:Deep learning
15773:
15772:
15771:
15768:
15764:
15761:
15760:
15759:
15756:
15755:
15753:
15749:
15743:
15740:
15738:
15735:
15731:
15728:
15727:
15726:
15723:
15721:
15718:
15714:
15711:
15709:
15706:
15704:
15701:
15700:
15699:
15696:
15694:
15691:
15689:
15686:
15684:
15681:
15679:
15676:
15674:
15671:
15669:
15666:
15664:
15663:Hallucination
15661:
15657:
15654:
15653:
15652:
15649:
15647:
15644:
15640:
15637:
15636:
15635:
15632:
15631:
15629:
15625:
15619:
15616:
15614:
15611:
15609:
15606:
15604:
15601:
15599:
15596:
15594:
15591:
15589:
15586:
15584:
15581:
15579:
15578:
15574:
15573:
15571:
15569:
15565:
15556:
15551:
15549:
15544:
15542:
15537:
15536:
15533:
15521:
15513:
15511:
15503:
15502:
15499:
15493:
15490:
15488:
15485:
15483:
15480:
15478:
15475:
15473:
15470:
15467:
15463:
15462:
15460:
15456:
15445:
15442:
15441:
15439:
15435:
15428:
15425:
15422:
15419:
15416:
15413:
15410:
15407:
15404:
15401:
15398:
15395:
15394:
15392:
15388:
15381:
15378:
15375:
15372:
15369:
15366:
15365:
15363:
15359:
15356:
15354:Generative AI
15352:
15342:
15339:
15337:
15334:
15332:
15329:
15328:
15326:
15322:
15315:
15312:
15309:
15306:
15303:
15300:
15299:
15297:
15293:
15290:
15286:
15275:
15274:AlphaGeometry
15272:
15269:
15266:
15263:
15260:
15257:
15254:
15253:
15251:
15247:
15236:
15235:
15231:
15228:
15227:
15223:
15222:
15220:
15216:
15209:
15206:
15203:
15200:
15197:
15194:
15193:
15191:
15187:
15180:
15177:
15174:
15171:
15168:
15165:
15162:
15159:
15156:
15153:
15152:
15150:
15146:
15143:
15139:
15136:
15132:
15126:
15123:
15121:
15118:
15116:
15113:
15112:
15109:
15105:
15098:
15093:
15091:
15086:
15084:
15079:
15078:
15075:
15065:
15060:
15055:
15050:
15045:
15040:
15037:
15033:
15030:
15026:
15025:
15011:
15007:
15002:
14997:
14992:
14987:
14983:
14979:
14975:
14968:
14960:
14955:
14951:
14944:
14936:
14931:
14927:
14920:
14911:
14906:
14898:
14896:
14887:
14874:
14866:
14859:
14857:
14842:
14838:
14832:
14823:
14818:
14810:
14801:
14796:
14789:
14780:
14775:
14768:
14760:
14756:
14752:
14745:
14731:
14727:
14721:
14713:
14709:
14704:
14699:
14695:
14691:
14687:
14680:
14671:
14666:
14658:
14649:
14644:
14637:
14628:
14623:
14616:
14601:
14597:
14593:
14587:
14579:
14574:
14570:
14563:
14554:
14549:
14545:
14541:
14537:
14530:
14515:
14511:
14507:
14501:
14492:
14487:
14479:
14470:
14465:
14458:
14456:
14447:
14442:
14438:
14431:
14423:
14416:
14408:
14403:
14399:
14392:
14390:
14381:
14375:
14361:
14357:
14350:
14336:
14335:
14328:
14320:
14314:
14310:
14306:
14301:
14296:
14292:
14288:
14281:
14272:
14267:
14259:
14250:
14245:
14237:
14223:
14219:
14213:
14198:
14197:Princeton NLP
14194:
14188:
14174:
14170:
14164:
14155:
14150:
14146:
14142:
14138:
14131:
14123:
14118:
14114:
14107:
14098:
14093:
14086:
14077:
14072:
14065:
14056:
14051:
14044:
14036:
14031:
14027:
14020:
14012:
14008:
14001:
13993:
13989:
13985:
13981:
13976:
13971:
13966:
13961:
13957:
13953:
13949:
13942:
13934:
13930:
13925:
13920:
13916:
13912:
13904:
13902:
13892:
13887:
13883:
13879:
13875:
13868:
13859:
13854:
13847:
13838:
13833:
13826:
13811:
13807:
13803:
13797:
13789:
13785:
13780:
13775:
13772:(140): 1–67.
13771:
13767:
13763:
13756:
13754:
13752:
13743:
13738:
13734:
13727:
13718:
13713:
13709:
13705:
13701:
13694:
13679:
13674:
13669:
13664:
13659:
13655:
13651:
13644:
13630:
13626:
13623:Team, Keras.
13619:
13604:
13600:
13596:
13589:
13587:
13585:
13569:
13565:
13561:
13555:
13546:
13541:
13534:
13532:
13530:
13521:
13516:
13512:
13505:
13503:
13501:
13499:
13484:
13480:
13474:
13472:
13457:
13453:
13447:
13445:
13435:
13430:
13422:
13414:
13410:
13405:
13400:
13396:
13392:
13388:
13381:
13372:
13367:
13359:
13357:
13348:
13343:
13339:
13332:
13324:
13323:
13315:
13307:
13302:
13298:
13291:
13282:
13277:
13269:
13267:
13257:
13252:
13244:
13242:
13227:
13226:
13219:
13204:
13200:
13196:
13190:
13176:
13172:
13166:
13152:
13148:
13142:
13127:
13123:
13117:
13108:
13103:
13096:
13094:
13079:
13075:
13071:
13067:
13060:
13052:
13047:
13043:
13036:
13021:
13017:
13013:
13009:
13005:
12998:
12996:
12986:
12981:
12974:
12966:
12962:
12958:
12954:
12947:
12932:
12928:
12924:
12920:
12916:
12909:
12900:
12895:
12888:
12879:
12874:
12867:
12858:
12853:
12849:
12845:
12841:
12834:
12827:
12823:
12819:
12815:
12810:
12805:
12800:
12795:
12791:
12787:
12780:
12771:
12766:
12759:
12750:
12745:
12738:
12736:
12734:
12725:
12721:
12716:
12711:
12707:
12703:
12696:
12694:
12692:
12683:
12679:
12672:
12664:
12660:
12653:
12645:
12641:
12637:
12630:
12621:
12614:
12608:
12600:
12596:
12592:
12588:
12584:
12580:
12573:
12569:
12563:
12561:
12552:
12548:
12544:
12540:
12536:
12532:
12528:
12524:
12520:
12513:
12505:
12499:
12492:
12491:
12483:
12475:
12471:
12467:
12463:
12459:
12455:
12451:
12444:
12429:
12425:
12421:
12415:
12413:
12411:
12402:
12398:
12394:
12390:
12386:
12378:
12376:
12366:
12361:
12354:
12352:
12343:
12339:
12335:
12331:
12327:
12323:
12319:
12312:
12303:
12298:
12291:
12283:
12279:
12271:
12263:
12258:
12254:
12247:
12245:
12235:
12230:
12223:
12214:
12209:
12202:
12200:
12184:
12180:
12176:
12170:
12168:
12159:
12155:
12151:
12147:
12143:
12139:
12135:
12131:
12127:
12123:
12119:
12115:
12109:
12101:
12097:
12090:
12086:
12082:
12076:
12074:
12072:
12070:
12068:
12066:
12064:
12062:
12060:
12058:
12056:
12051:
12035:
12028:
12023:
12019:
12008:
12005:
12002:
11999:
11996:
11993:
11990:
11987:
11984:
11981:
11978:
11975:
11972:
11969:
11968:
11959:
11955:
11951:
11947:
11944:
11941:
11937:
11934:
11932:
11929:
11927:
11924:
11923:
11922:
11917:
11914:
11911:
11908:
11905:
11902:
11900:
11897:
11895:
11892:
11889:
11886:
11884:
11881:
11880:
11879:
11877:
11873:
11869:
11865:
11861:
11857:
11853:
11849:
11845:
11841:
11837:
11827:
11824:
11820:
11815:
11813:
11809:
11807:
11803:
11799:
11794:
11792:
11788:
11786:
11782:
11776:
11771:Multimodality
11768:
11766:
11747:
11742:
11738:
11734:
11731:
11725:
11703:
11699:
11695:
11692:
11689:
11686:
11683:
11678:
11674:
11646:
11642:
11635:
11631:
11626:
11622:
11615:
11612:
11609:
11605:
11597:
11593:
11579:
11575:
11569:
11560:
11554:
11551:
11548:
11545:
11542:
11512:
11507:
11503:
11494:
11490:
11483:
11474:
11458:
11454:
11450:
11445:
11441:
11437:
11434:
11406:
11402:
11395:
11388:
11384:
11380:
11376:
11370:
11360:
11356:
11348:
11342:
11338:
11332:
11324:
11318:
11311:
11306:
11302:
11293:
11289:
11282:
11275:
11271:
11267:
11263:
11257:
11247:
11243:
11235:
11229:
11225:
11219:
11211:
11205:
11199:
11196:
11192:
11184:
11180:
11166:
11162:
11156:
11147:
11141:
11138:
11135:
11132:
11129:
11095:
11089:
11082:
11078:
11074:
11070:
11064:
11056:
11049:
11045:
11039:
11033:
11026:
11022:
11018:
11014:
11008:
11000:
10993:
10986:
10974:
10968:
10961:
10957:
10953:
10949:
10943:
10935:
10928:
10924:
10918:
10912:
10905:
10901:
10897:
10893:
10887:
10879:
10872:
10857:
10850:
10846:
10841:
10834:
10831:
10828:
10821:
10793:
10789:
10785:
10778:
10770:
10767:
10764:
10755:
10751:
10747:
10735:
10729:
10726:
10720:
10714:
10680:
10675:
10671:
10667:
10664:
10658:
10636:
10632:
10628:
10625:
10622:
10619:
10616:
10611:
10607:
10584:
10573:
10570:
10565:
10561:
10554:
10551:
10548:
10542:
10539:
10534:
10530:
10523:
10520:
10517:
10514:
10508:
10505:
10500:
10496:
10489:
10486:
10483:
10477:
10474:
10469:
10465:
10458:
10455:
10446:
10442:
10437:
10431:
10425:
10417:
10407:
10403:
10386:
10380:
10372:
10351:
10347:
10340:
10331:
10329:
10310:
10307:
10304:
10301:
10295:
10270:
10266:
10259:
10250:
10236:
10211:
10207:
10200:
10186:
10184:
10174:
10170:
10154:
10144:
10118:
10108:
10082:
10078:
10055:
10045:
10019:
10009:
9983:
9973:
9966:
9961:
9951:
9944:
9939:
9929:
9922:
9917:
9907:
9893:
9879:
9857:
9853:
9830:
9826:
9822:
9819:
9816:
9813:
9810:
9805:
9801:
9797:
9792:
9788:
9777:
9775:
9764:
9760:
9758:
9757:memory paging
9754:
9750:
9740:
9737:
9722:
9718:
9713:
9704:
9700:
9696:
9693:
9688:
9684:
9680:
9677:
9672:
9667:
9663:
9659:
9647:
9632:
9625:
9622:
9612:
9606:
9603:
9600:
9597:
9594:
9578:
9562:
9558:
9554:
9549:
9545:
9522:
9518:
9513:
9504:
9499:
9495:
9491:
9488:
9483:
9478:
9474:
9470:
9467:
9462:
9457:
9453:
9449:
9437:
9422:
9415:
9412:
9402:
9396:
9393:
9390:
9387:
9384:
9368:
9360:
9358:
9352:
9348:
9346:
9342:
9338:
9332:
9330:
9326:
9316:
9314:
9310:
9306:
9302:
9298:
9288:
9273:
9270:
9266:
9262:
9259:
9255:
9252:
9249:
9246:
9223:
9220:
9216:
9212:
9209:
9204:
9200:
9195:
9192:
9189:
9185:
9176:
9160:
9136:
9132:
9128:
9125:
9117:
9113:
9099:
9095:
9088:
9079:
9073:
9070:
9067:
9064:
9061:
9035:
9031:
9014:
8994:
8974:
8971:
8968:
8965:
8960:
8957:
8954:
8950:
8927:
8921:
8916:
8911:
8906:
8901:
8894:
8889:
8884:
8881:
8876:
8873:
8868:
8865:
8858:
8853:
8848:
8843:
8840:
8835:
8832:
8825:
8820:
8815:
8810:
8805:
8802:
8795:
8790:
8785:
8780:
8775:
8769:
8764:
8761:
8753:
8737:
8717:
8693:
8689:
8685:
8682:
8679:
8671:
8667:
8653:
8649:
8642:
8633:
8627:
8624:
8621:
8618:
8615:
8595:
8591:
8581:
8567:
8540:
8537:
8534:
8531:
8528:
8511:
8499:
8496:
8493:
8490:
8487:
8472:
8462:
8459:
8456:
8439:
8427:
8424:
8421:
8400:
8381:
8374:
8370:
8367:
8364:
8361:
8358:
8350:
8343:
8322:
8319:
8297:
8293:
8287:
8284:
8281:
8277:
8273:
8263:
8260:
8255:
8251:
8213:
8205:
8201:
8197:
8194:
8186:
8178:
8174:
8170:
8165:
8161:
8138:
8132:
8129:
8126:
8123:
8115:
8107:
8103:
8099:
8096:
8093:
8090:
8087:
8079:
8071:
8067:
8059:
8056:
8053:
8050:
8042:
8034:
8030:
8026:
8023:
8020:
8017:
8014:
8006:
7998:
7994:
7987:
7982:
7977:
7966:
7958:
7954:
7941:
7933:
7929:
7922:
7915:
7909:
7906:
7903:
7900:
7895:
7892:
7889:
7886:
7879:
7876:
7873:
7870:
7867:
7862:
7859:
7856:
7853:
7847:
7842:
7832:
7829:
7821:
7813:
7809:
7805:
7797:
7789:
7785:
7752:
7729:
7726:
7723:
7720:
7709:
7701:
7697:
7693:
7685:
7677:
7673:
7666:
7655:
7647:
7643:
7639:
7631:
7623:
7619:
7612:
7601:
7593:
7589:
7585:
7577:
7569:
7565:
7544:
7540:
7532:
7530:
7526:
7516:
7514:
7510:
7507:
7492:
7488:
7471:
7459:
7441:
7428:
7423:
7414:
7404:
7402:
7398:
7394:
7390:
7386:
7380:
7378:
7374:
7370:
7366:
7360:
7358:
7354:
7349:
7339:
7335:
7331:
7327:
7323:
7319:
7315:
7311:
7307:
7303:
7299:
7295:
7291:
7287:
7283:
7279:
7275:
7271:
7267:
7263:
7259:
7255:
7251:
7247:
7243:
7240:
7236:
7232:
7228:
7225:
7221:
7217:
7213:
7209:
7205:
7201:
7197:
7193:
7189:
7185:
7181:
7177:
7173:
7169:
7165:
7161:
7157:
7153:
7150:
7146:
7142:
7138:
7135:
7131:
7127:
7123:
7115:
7094:
7030:
7027:
7018:
7001:
6946:
6914:
6911:
6871:
6867:
6862:
6860:
6856:
6848:
6844:
6839:
6831:
6823:
6815:
6811:
6803:
6789:
6785:
6769:
6765:
6732:
6728:
6724:
6719:
6715:
6711:
6707:
6704:
6684:
6682:
6674:
6656:
6653:
6650:
6647:
6644:
6633:
6631:
6625:
6622:
6608:
6604:
6602:
6596:
6594:
6588:
6581:
6572:
6569:
6565:
6523:
6520:
6517:
6514:
6511:
6492:
6486:
6447:
6428:
6422:
6410:
6402:
6399:
6396:
6393:
6390:
6362:
6354:
6351:
6348:
6345:
6342:
6320:
6315:
6313:
6305:
6288:
6282:
6273:
6269:
6259:
6255:
6248:
6243:
6241:
6236:
6224:
6221:
6216:
6212:
6208:
6203:
6199:
6177:
6170:
6161:
6159:
6143:
6121:
6118:
6114:
6104:
6100:
6092:
6074:
6068:
6063:
6058:
6053:
6048:
6041:
6036:
6031:
6026:
6021:
6011:
6006:
6001:
5996:
5991:
5981:
5976:
5968:
5963:
5958:
5948:
5943:
5935:
5927:
5922:
5916:
5911:
5902:
5892:
5874:
5870:
5861:
5857:
5843:
5839:
5833:
5830:
5826:
5817:
5811:
5808:
5805:
5802:
5799:
5767:
5744:
5724:
5704:
5701:
5698:
5678:
5664:
5648:
5645:
5639:
5636:
5633:
5620:
5615:
5611:
5590:
5587:
5584:
5581:
5578:
5558:
5555:
5546:
5542:
5539:
5536:
5527:
5523:
5520:
5517:
5508:
5498:
5478:
5468:
5452:
5448:
5427:
5405:
5400:
5396:
5392:
5387:
5382:
5378:
5374:
5369:
5364:
5360:
5339:
5317:
5313:
5301:
5296:
5292:
5288:
5285:
5280:
5275:
5271:
5267:
5264:
5259:
5254:
5250:
5246:
5221:
5214:
5211:
5201:
5195:
5192:
5189:
5186:
5183:
5155:
5146:
5144:
5140:
5136:
5119:
5113:
5109:
5105:
5100:
5096:
5092:
5087:
5083:
5078:
5064:
5056:
5047:
5027:
5023:
5014:
5010:
5001:
4974:
4970:
4961:
4957:
4948:
4938:
4918:
4914:
4905:
4900:
4891:
4887:
4878:
4873:
4864:
4860:
4851:
4824:
4816:
4794:
4786:
4764:
4756:
4734:
4726:
4721:
4718:
4701:
4697:
4689:
4685:
4671:
4667:
4661:
4652:
4646:
4643:
4640:
4637:
4634:
4600:
4596:
4573:
4569:
4546:
4542:
4521:
4501:
4481:
4461:
4453:
4448:
4434:
4412:
4409:
4405:
4384:
4362:
4358:
4354:
4349:
4345:
4324:
4304:
4282:
4278:
4274:
4269:
4265:
4244:
4224:
4202:
4198:
4175:
4171:
4162:
4142:
4138:
4113:
4109:
4086:
4082:
4073:
4057:
4037:
4015:
4012:
4008:
3998:
3978:
3974:
3965:
3942:
3938:
3934:
3929:
3925:
3921:
3916:
3912:
3902:
3886:
3882:
3872:
3868:
3865:
3843:
3839:
3829:
3825:
3822:
3800:
3796:
3786:
3782:
3779:
3757:
3753:
3742:
3739:
3735:
3731:
3726:
3722:
3699:
3695:
3667:
3664:
3660:
3650:
3630:
3603:
3593:
3577:
3573:
3550:
3546:
3523:
3519:
3510:
3507:
3498:
3490:
3480:
3470:
3450:
3446:
3443:
3434:
3425:
3421:
3417:
3412:
3410:
3394:
3369:
3362:
3358:
3350:
3343:
3331:
3324:
3320:
3312:
3305:
3301:
3295:
3292:
3286:
3264:
3238:
3211:
3201:
3192:
3190:
3185:
3181:
3178:
3174:
3171:Like earlier
3165:
3157:
3148:
3146:
3141:
3139:
3136:
3118:
3114:
3090:
3084:
3080:
3068:
3064:
3054:
3031:
3027:
3021:
3017:
3012:
3008:
3000:
2996:
2989:
2986:
2980:
2975:
2971:
2965:
2961:
2951:
2932:
2929:
2903:
2897:
2888:
2879:
2859:
2853:
2847:
2844:
2838:
2829:
2813:
2809:
2805:
2801:
2797:
2794:
2772:
2769:
2764:
2761:
2756:
2753:
2750:
2747:
2744:
2741:
2738:
2735:
2730:
2723:
2719:
2714:
2710:
2707:
2703:
2699:
2694:
2688:
2682:
2661:
2657:
2653:
2635:
2632:
2623:
2609:
2606:
2603:
2583:
2563:
2554:
2538:
2534:
2530:
2526:
2522:
2519:
2516:
2509:
2505:
2501:
2496:
2493:
2470:
2467:
2464:
2460:
2456:
2453:
2450:
2447:
2444:
2441:
2438:
2432:
2429:
2416:
2410:
2407:
2404:
2398:
2392:
2389:
2383:
2375:
2372:
2369:
2366:
2358:
2352:
2349:
2344:
2341:
2333:
2327:
2316:
2300:
2280:
2277:
2274:
2271:
2263:
2260:
2257:
2252:
2234:
2231:
2222:
2220:
2219:man bites dog
2216:
2212:
2193:
2190:
2187:
2184:
2181:
2178:
2175:
2167:
2162:
2153:
2130:
2126:
2117:
2090:
2087:
2084:
2081:
2052:
2046:
2012:
2007:
1998:
1979:
1970:
1966:
1961:
1947:
1941:
1938:
1935:
1932:
1929:
1926:
1923:
1920:
1917:
1914:
1911:
1908:
1905:
1899:
1893:
1850:
1847:
1844:
1841:
1838:
1835:
1832:
1829:
1826:
1823:
1820:
1817:
1814:
1791:
1771:
1763:
1759:
1753:
1743:
1741:
1736:
1716:
1708:
1703:
1701:
1694:
1690:
1680:
1666:
1663:
1654:
1652:
1644:
1640:
1637:
1634:
1633:
1632:
1624:
1622:
1618:
1613:
1611:
1605:
1603:
1597:
1595:
1571:
1560:
1557:
1552:masked tokens
1547:
1544:
1540:
1536:
1533:
1520:
1516:
1515:loss function
1511:
1506:
1496:
1490:
1485:
1482:
1478:
1476:
1474:
1470:
1464:
1460:
1459:
1458:
1456:
1452:
1445:
1442:
1440:
1437:
1435:
1432:
1430:
1427:
1424:
1422:
1419:
1418:
1417:
1415:
1411:
1408:
1404:
1394:
1392:
1389:
1384:
1370:
1368:
1364:
1360:
1356:
1352:
1348:
1343:
1341:
1337:
1333:
1329:
1324:
1322:
1318:
1314:
1310:
1306:
1301:
1299:
1295:
1284:
1282:
1278:
1274:
1269:
1267:
1263:
1259:
1255:
1251:
1247:
1240:
1230:
1228:
1225:
1221:
1220:
1215:
1211:
1207:
1202:
1199:
1197:
1193:
1188:
1186:
1181:
1176:
1174:
1170:
1166:
1162:
1157:
1152:
1142:
1139:
1134:
1132:
1131:
1126:
1122:
1118:
1114:
1109:
1107:
1103:
1102:Elman network
1099:
1088:
1078:
1076:
1072:
1068:
1064:
1060:
1056:
1052:
1048:
1044:
1039:
1037:
1033:
1030:
1026:
1022:
1018:
1013:
1010:
1006:
1002:
998:
994:
990:
986:
985:deep learning
982:
972:
961:
956:
954:
949:
947:
942:
941:
939:
938:
931:
928:
924:
921:
920:
919:
916:
914:
911:
910:
904:
903:
896:
893:
891:
888:
886:
883:
881:
878:
876:
873:
871:
868:
866:
863:
862:
856:
855:
848:
845:
843:
840:
838:
835:
833:
830:
828:
825:
823:
820:
818:
815:
813:
810:
809:
803:
802:
795:
792:
790:
787:
785:
782:
780:
777:
776:
770:
769:
762:
759:
757:
754:
752:
751:Crowdsourcing
749:
747:
744:
743:
737:
736:
727:
724:
723:
722:
719:
717:
714:
712:
709:
707:
704:
703:
700:
695:
694:
686:
683:
681:
680:Memtransistor
678:
676:
673:
671:
668:
664:
661:
660:
659:
656:
654:
651:
647:
644:
642:
639:
637:
634:
632:
629:
628:
627:
624:
622:
619:
617:
614:
612:
609:
605:
602:
601:
600:
597:
593:
590:
588:
585:
583:
580:
578:
575:
574:
573:
570:
568:
565:
563:
562:Deep learning
560:
558:
555:
554:
551:
546:
545:
538:
535:
533:
530:
528:
526:
522:
520:
517:
516:
513:
508:
507:
498:
497:Hidden Markov
495:
493:
490:
488:
485:
484:
483:
480:
479:
476:
471:
470:
463:
460:
458:
455:
453:
450:
448:
445:
443:
440:
438:
435:
433:
430:
428:
425:
423:
420:
419:
416:
411:
410:
403:
400:
398:
395:
393:
389:
387:
384:
382:
379:
377:
375:
371:
369:
366:
364:
361:
359:
356:
355:
352:
347:
346:
339:
336:
334:
331:
329:
326:
324:
321:
319:
316:
314:
311:
309:
306:
304:
302:
298:
294:
293:Random forest
291:
289:
286:
284:
281:
280:
279:
276:
274:
271:
269:
266:
265:
258:
257:
252:
251:
243:
237:
236:
229:
226:
224:
221:
219:
216:
214:
211:
209:
206:
204:
201:
199:
196:
194:
191:
189:
186:
184:
181:
179:
178:Data cleaning
176:
174:
171:
169:
166:
164:
161:
159:
156:
154:
151:
149:
146:
144:
141:
140:
134:
133:
126:
123:
121:
118:
116:
113:
111:
108:
106:
103:
101:
98:
96:
93:
91:
90:Meta-learning
88:
86:
83:
81:
78:
76:
73:
71:
68:
66:
63:
62:
56:
55:
52:
47:
44:
43:
39:
38:
33:
19:
16285:Hugging Face
16249:David Silver
15897:Audio–visual
15751:Applications
15730:Augmentation
15575:
15487:Google Pixel
15307:
15232:
15224:
15189:Competitions
15167:AlphaGo Zero
15120:Google Brain
14981:
14977:
14967:
14949:
14943:
14925:
14919:
14873:cite journal
14844:. Retrieved
14840:
14831:
14809:
14788:
14767:
14758:
14754:
14744:
14733:. Retrieved
14729:
14720:
14693:
14689:
14679:
14657:
14636:
14615:
14604:. Retrieved
14595:
14586:
14568:
14562:
14543:
14539:
14529:
14518:. Retrieved
14509:
14500:
14478:
14436:
14430:
14415:
14397:
14363:. Retrieved
14359:
14349:
14339:, retrieved
14333:
14327:
14290:
14280:
14258:
14236:
14225:. Retrieved
14221:
14212:
14201:. Retrieved
14199:. 2023-06-17
14196:
14187:
14176:. Retrieved
14172:
14163:
14144:
14140:
14130:
14112:
14106:
14085:
14064:
14043:
14025:
14019:
14010:
14000:
13955:
13951:
13941:
13914:
13881:
13877:
13867:
13858:1606.08415v5
13846:
13825:
13814:. Retrieved
13805:
13796:
13769:
13765:
13732:
13726:
13707:
13703:
13693:
13682:. Retrieved
13653:
13643:
13632:. Retrieved
13628:
13618:
13607:. Retrieved
13598:
13572:. Retrieved
13563:
13554:
13545:1810.04805v2
13510:
13486:. Retrieved
13482:
13459:. Retrieved
13455:
13421:
13394:
13390:
13380:
13337:
13331:
13321:
13314:
13296:
13290:
13230:, retrieved
13224:
13218:
13207:. Retrieved
13198:
13189:
13178:. Retrieved
13174:
13165:
13154:. Retrieved
13150:
13141:
13130:. Retrieved
13128:. 2020-10-15
13125:
13116:
13107:1810.04805v2
13081:. Retrieved
13069:
13059:
13041:
13035:
13024:. Retrieved
13007:
12973:
12956:
12946:
12935:. Retrieved
12931:the original
12918:
12908:
12887:
12866:
12847:
12843:
12833:
12789:
12785:
12779:
12758:
12705:
12681:
12671:
12662:
12652:
12643:
12639:
12629:
12620:
12607:
12582:
12578:
12526:
12522:
12512:
12489:
12482:
12457:
12453:
12443:
12432:. Retrieved
12423:
12384:
12365:2402.04494v1
12325:
12321:
12311:
12290:
12281:
12270:
12252:
12222:
12187:. Retrieved
12178:
12125:
12121:
12108:
12099:
12095:
12034:
12022:
11920:
11838:(NLP). Many
11833:
11830:Applications
11816:
11810:
11795:
11789:
11777:
11774:
11475:
10413:
10404:
10332:
10251:
10192:
10182:
10180:
10171:
9894:
9892:-th output.
9778:
9770:
9761:
9752:
9748:
9746:
9738:
9579:
9369:
9366:
9353:
9349:
9333:
9322:
9313:Hugging Face
9309:Transformers
9308:
9294:
9041:
9032:
8751:
8593:
8589:
8587:
8401:
7550:
7541:
7538:
7529:Llama series
7522:
7513:Llama series
7503:
7489:
7405:
7381:
7361:
7350:
7347:
7337:
7333:
7329:
7325:
7321:
7317:
7313:
7309:
7305:
7301:
7297:
7293:
7289:
7285:
7281:
7277:
7273:
7269:
7265:
7261:
7257:
7253:
7249:
7245:
7241:
7238:
7234:
7230:
7226:
7223:
7219:
7215:
7211:
7207:
7203:
7199:
7195:
7191:
7187:
7183:
7179:
7175:
7171:
7167:
7163:
7159:
7155:
7151:
7148:
7144:
7140:
7136:
7133:
7129:
7125:
7121:
7019:
6869:
6865:
6863:
6852:
6808:
6786:
6668:DecoderLayer
6609:
6605:
6597:
6592:
6589:
6586:
6570:
6566:
6480:EncoderLayer
6448:
6299:EncoderLayer
6178:
6175:
6156:is a random
5893:
5670:
5499:
5469:
5147:
5134:
5069:
4939:
4814:
4784:
4754:
4724:
4722:
4719:
4449:
3999:
3903:
3651:
3594:
3503:
3423:
3419:
3415:
3413:
3411:activation.
3260:
3186:
3182:
3176:
3170:
3142:
2952:
2830:
2624:
2555:
2223:
2215:bag of words
2210:
2208:
2008:
2004:
2001:Un-embedding
1968:
1964:
1962:
1758:lookup table
1755:
1737:
1706:
1704:
1696:
1689:Tokenization
1683:Tokenization
1655:
1648:
1630:
1627:Architecture
1614:
1606:
1598:
1512:
1508:
1493:
1488:
1472:
1469:for inviting
1468:
1466:
1462:
1448:
1444:paraphrasing
1400:
1390:
1385:
1381:
1365:(2024), and
1344:
1325:
1313:bag of words
1302:
1290:
1270:
1265:
1261:
1245:
1242:
1226:
1223:
1218:
1203:
1200:
1195:
1191:
1189:
1184:
1179:
1177:
1168:
1164:
1158:
1154:
1135:
1128:
1124:
1120:
1110:
1095:
1092:Predecessors
1040:
1036:Common Crawl
1014:
980:
978:
975:Transformer.
837:PAC learning
524:
373:
368:Hierarchical
300:
254:
248:
16433:Categories
16381:Autoencoder
16336:Transformer
16204:Alex Graves
16152:OpenAI Five
16056:IBM Watsonx
15678:Convolution
15656:Overfitting
15482:Google Labs
15308:Transformer
14264:Pathways".
11958:grandmaster
11888:time series
11806:spectrogram
10097:to replace
9177:, that is,
8752:linear bias
8590:replacement
7344:Terminology
5070:One set of
4072:dot product
4030:from token
3506:dot-product
3422:(BERT), or
3420:filter size
1965:hidden size
1467:"Thank you
1410:fine-tuning
1287:AI boom era
1138:fast weight
1073:(GPTs) and
981:transformer
721:Multi-agent
658:Transformer
557:Autoencoder
313:Naive Bayes
51:data mining
16459:Categories
16422:Technology
16275:EleutherAI
16234:Fei-Fei Li
16229:Yann LeCun
16142:Q-learning
16125:Decisional
16051:IBM Watson
15959:Midjourney
15851:TensorFlow
15698:Activation
15651:Regression
15646:Clustering
15409:Chinchilla
15336:TensorFlow
15234:The MANIAC
15064:2405.00208
15049:2207.09238
14984:(1): 157.
14959:2206.10789
14935:2102.12092
14910:2301.00704
14846:2024-08-09
14822:2107.14795
14800:2103.03206
14779:2212.04356
14735:2024-08-11
14670:2006.03555
14648:2103.02143
14627:2105.14103
14606:2021-05-28
14578:1904.10509
14553:1707.04585
14520:2020-10-22
14491:2011.04006
14469:2001.04451
14446:2302.01318
14407:2211.17192
14365:2024-06-20
14341:2024-06-20
14300:2309.06180
14271:2204.02311
14249:2305.13245
14227:2023-07-18
14203:2023-07-18
14178:2023-07-18
14154:2205.14135
14122:2006.15595
14097:1803.02155
14076:2108.12409
14055:2104.09864
14035:2203.16634
13965:2102.11090
13924:1910.05895
13891:1910.07467
13837:2002.05202
13816:2024-08-07
13779:1910.10683
13742:2207.09238
13717:1906.08237
13684:2020-05-20
13663:1906.04341
13634:2024-08-08
13609:2019-10-15
13574:2019-10-15
13520:2205.05131
13488:2023-10-05
13461:2023-10-05
13434:1910.10683
13404:1910.10683
13371:2002.04745
13347:2403.03206
13306:2009.14794
13281:2005.08100
13256:2010.11929
13232:2023-05-01
13209:2023-03-18
13199:openai.com
13180:2024-08-06
13156:2024-05-08
13132:2020-11-24
13083:2024-08-27
13051:2305.13048
13026:2024-08-06
12985:1606.01933
12937:2023-06-22
12899:1609.08144
12878:1508.04025
12434:2019-08-25
12302:2212.04356
12262:2106.01345
12234:1508.04025
12189:2019-08-25
12046:References
11946:evaluating
11890:prediction
11856:AlbertAGPT
11812:Perceivers
9749:KV caching
9301:TensorFlow
9297:frameworks
8594:additional
7373:GPT series
7118:Pseudocode
6841:Schematic
4869:seq, value
4785:value size
4725:query size
3970:emb, query
3635:emb, query
3608:seq, query
2166:sinusoidal
2135:vocabulary
1721:vocabulary
1503:See also:
1407:supervised
1351:multimodal
1328:GPT series
1256:result in
1192:fixed-size
1085:See also:
1069:, such as
706:Q-learning
604:Restricted
402:Mean shift
351:Clustering
328:Perceptron
256:regression
158:Clustering
153:Regression
16305:MIT CSAIL
16270:Anthropic
16239:Andrew Ng
16137:AlphaZero
15981:VideoPoet
15944:AlphaFold
15881:MindSpore
15835:SpiNNaker
15830:Memristor
15737:Diffusion
15713:Rectifier
15693:Batchnorm
15673:Attention
15668:Adversary
15427:VideoPoet
15368:Assistant
15262:AlphaStar
15256:AlphaFold
15202:Lee Sedol
15173:AlphaZero
15104:Google AI
14730:lmsys.org
14712:2374-3468
14360:vLLM Blog
13992:231986066
13984:0891-2017
13788:1533-7928
13413:1532-4435
13078:0028-792X
13016:1059-1028
12927:0362-4331
12857:1409.3215
12826:220252321
12770:1412.3555
12749:1409.3215
12715:1406.1078
12682:ICML 2021
12663:ICML 2020
12543:0003-6935
12474:0364-0213
12401:208117506
12342:2377-3766
12213:1409.0473
12142:0899-7667
11977:Perceiver
11940:AlphaFold
11938:(such as
11739:σ
11613:≈
11536:Attention
11484:φ
11435:σ
11396:φ
11385:σ
11367:‖
11353:‖
11339:∑
11319:φ
11283:φ
11272:σ
11254:‖
11240:‖
11226:∑
11206:φ
11200:≈
11123:Attention
11102:⟩
11090:φ
11079:σ
11061:‖
11054:‖
11034:φ
11023:σ
11005:‖
10998:‖
10990:⟨
10987:≈
10981:⟩
10969:φ
10958:σ
10940:‖
10933:‖
10913:φ
10902:σ
10884:‖
10877:‖
10869:⟨
10847:σ
10838:⟩
10826:⟨
10790:σ
10775:‖
10768:−
10762:‖
10756:−
10742:⟩
10730:φ
10715:φ
10712:⟨
10672:σ
10577:⟩
10558:⟨
10555:
10546:⟩
10527:⟨
10524:
10518:⋯
10512:⟩
10493:⟨
10490:
10481:⟩
10462:⟨
10459:
10426:φ
10326:by using
10308:
10148:~
10112:~
10049:~
10013:~
9977:~
9955:~
9933:~
9911:~
9653:Attention
9626:∈
9443:Attention
9416:∈
9267:−
9250:−
9239:whenever
9055:Attention
9018:∞
9015:−
8972:−
8922:⋱
8917:⋮
8912:⋮
8907:⋮
8902:⋮
8895:⋯
8882:−
8874:−
8866:−
8859:⋯
8841:−
8833:−
8826:⋯
8803:−
8796:⋯
8609:Attention
8375:θ
8344:θ
8288:θ
8133:θ
8127:
8097:θ
8091:
8060:θ
8054:
8027:−
8024:θ
8018:
7910:θ
7904:
7896:θ
7890:
7880:θ
7874:
7868:−
7863:θ
7857:
7753:θ
7445:∞
7442:−
7401:T5 series
6797:Sublayers
6423:⋮
6283:⋮
6225:…
6119:−
6064:…
6042:⋮
6037:⋱
6032:⋮
6027:⋮
6022:⋮
6015:∞
6012:−
6007:…
5985:∞
5982:−
5977:…
5972:∞
5969:−
5952:∞
5949:−
5944:…
5939:∞
5936:−
5931:∞
5928:−
5748:∞
5745:−
5646:×
5637:×
5621:∈
5582:×
5240:Attention
5215:∈
5011:≠
4865:ℓ
4852:ℓ
4628:Attention
4355:⋅
4275:⋅
4050:to token
3604:ℓ
3509:attention
3395:ϕ
3296:ϕ
3061:Δ
3018:∑
2993:Δ
2962:∑
2933:∈
2927:Δ
2886:Δ
2851:Δ
2770:−
2754:…
2644:→
2494:θ
2468:−
2451:…
2433:∈
2427:∀
2417:θ
2411:
2399:θ
2393:
2264:∈
2243:→
1942:…
1851:…
1746:Embedding
1700:tokenizer
1642:variants.
1561:
1548:∈
1541:∑
1537:−
1204:In 2016,
1196:RNNsearch
1029:Knowledge
993:attention
865:ECML PKDD
847:VC theory
794:ROC curve
726:Self-play
646:DeepDream
487:Bayes net
278:Ensembles
59:Paradigms
16413:Portals
16172:Auto-GPT
16004:Word2vec
15808:Hardware
15725:Datasets
15627:Concepts
15510:Category
15458:See also
15361:Chatbots
15268:AlphaDev
15148:Versions
15032:Archived
15010:36855134
14600:Archived
14514:Archived
14374:cite web
14222:TOGETHER
13810:Archived
13678:Archived
13629:keras.io
13603:Archived
13568:Archived
13203:Archived
13020:Archived
12818:33733157
12599:16683347
12570:(1992).
12551:20523475
12428:Archived
12183:Archived
11965:See also
11842:such as
11819:DALL-E 1
9755:applies
9577:, thus:
9299:such as
9274:′
9263:′
9224:′
9213:′
7419:prefixLM
7326:for each
7314:for each
7302:for each
7290:for each
7278:for each
7266:for each
7254:for each
7212:for each
7200:for each
7188:for each
7176:for each
7164:for each
6868:and the
6708:′
6626:′
6136:, where
5737:that is
5145:layers.
5139:parallel
4856:seq, key
4755:key size
4074:between
2293:, where
1414:The Pile
1373:Training
1361:(2021),
1317:word2vec
288:Boosting
137:Problems
16295:Meta AI
16132:AlphaGo
16116:PanGu-Σ
16086:ChatGPT
16061:Granite
16009:Seq2seq
15988:Whisper
15909:WaveNet
15904:AlexNet
15876:Flux.jl
15856:PyTorch
15708:Sigmoid
15703:Softmax
15568:General
15520:Commons
15374:Sparrow
15302:WaveNet
15226:AlphaGo
15196:Fan Hui
15155:AlphaGo
15141:AlphaGo
15001:9972634
12809:7861254
12158:1915014
12150:9377276
11971:seq2seq
11950:Minimax
11876:ChatGPT
11872:RoBERTa
11798:Whisper
11565:softmax
11152:softmax
9743:Caching
9305:PyTorch
9084:softmax
8750:is the
8638:softmax
7525:RMSNorm
7130:output:
6866:post-LN
6575:Decoder
6164:Encoder
5822:softmax
4657:softmax
4161:softmax
4070:is the
3997:, etc.
3418:(GPT),
3173:seq2seq
2315:integer
2011:softmax
1762:one-hot
1336:ChatGPT
1298:AI boom
1277:seq2seq
1262:without
1169:decoder
1165:encoder
1081:History
870:NeurIPS
687:(ECRAM)
641:AlexNet
283:Bagging
16310:Huawei
16290:OpenAI
16192:People
16162:MuZero
16024:Gemini
16019:Claude
15954:DALL-E
15866:Theano
15446:(2024)
15429:(2024)
15423:(2023)
15421:Gemini
15417:(2022)
15411:(2022)
15405:(2021)
15399:(2018)
15382:(2023)
15380:Gemini
15376:(2022)
15370:(2016)
15316:(2022)
15310:(2017)
15304:(2016)
15276:(2024)
15270:(2023)
15264:(2019)
15258:(2018)
15237:(2023)
15229:(2017)
15210:(2017)
15208:Ke Jie
15204:(2016)
15198:(2015)
15181:(2019)
15179:MuZero
15175:(2017)
15169:(2017)
15163:(2016)
15161:Master
15157:(2015)
15115:Google
15008:
14998:
14710:
14315:
13990:
13982:
13786:
13564:Indico
13411:
13076:
13014:
12925:
12824:
12816:
12806:
12792:: 40,
12597:
12549:
12541:
12500:
12472:
12399:
12340:
12179:OpenAI
12156:
12148:
12140:
11960:level.
11860:Claude
11427:where
10599:where
10229:where
10133:, and
9618:Concat
9408:Concat
9339:GPUs (
9153:where
8710:Here,
7464:causal
7338:return
7126:input:
6965:where
6870:pre-LN
6849:style.
6757:where
6449:where
6109:causal
5907:causal
5571:Since
5440:, and
5207:Concat
4588:, and
4337:(i.e.
4257:(i.e.
3387:where
2919:where
2787:where
2556:Here,
2486:where
2013:layer:
1475:week".
1391:before
1359:DALL-E
1032:corpus
1001:tokens
989:Google
663:Vision
519:RANSAC
397:OPTICS
392:DBSCAN
376:-means
183:AutoML
16376:Mamba
16147:SARSA
16111:LLaMA
16106:BLOOM
16091:GPT-J
16081:GPT-4
16076:GPT-3
16071:GPT-2
16066:GPT-1
16029:LaMDA
15861:Keras
15437:Other
15403:LaMDA
15324:Other
15249:Other
15059:arXiv
15044:arXiv
14954:arXiv
14930:arXiv
14905:arXiv
14817:arXiv
14795:arXiv
14774:arXiv
14665:arXiv
14643:arXiv
14622:arXiv
14573:arXiv
14548:arXiv
14486:arXiv
14464:arXiv
14441:arXiv
14402:arXiv
14295:arXiv
14266:arXiv
14244:arXiv
14149:arXiv
14117:arXiv
14092:arXiv
14071:arXiv
14050:arXiv
14030:arXiv
13988:S2CID
13960:arXiv
13919:arXiv
13886:arXiv
13853:arXiv
13832:arXiv
13774:arXiv
13737:arXiv
13712:arXiv
13658:arXiv
13540:arXiv
13515:arXiv
13429:arXiv
13399:arXiv
13366:arXiv
13342:arXiv
13301:arXiv
13276:arXiv
13251:arXiv
13102:arXiv
13046:arXiv
13008:Wired
12980:arXiv
12894:arXiv
12873:arXiv
12852:arXiv
12822:S2CID
12765:arXiv
12744:arXiv
12710:arXiv
12595:S2CID
12575:(PDF)
12494:(PDF)
12397:S2CID
12360:arXiv
12297:arXiv
12257:arXiv
12229:arXiv
12208:arXiv
12154:S2CID
12092:(PDF)
12014:Notes
11906:(NER)
11868:XLNet
11852:GPT-4
11848:GPT-3
11844:GPT-2
10813:, or
9637:heads
9427:heads
9329:cache
9173:is a
8584:ALiBi
6091:XLNet
5226:heads
5032:value
5006:query
4979:value
4953:query
4910:value
4883:query
4799:value
4739:query
3983:query
3877:value
3791:query
3747:query
3672:query
2610:10000
2211:where
2182:10000
1499:Tasks
1185:fixed
1063:chess
983:is a
885:IJCAI
711:SARSA
670:Mamba
636:LeNet
631:U-Net
457:t-SNE
381:Fuzzy
358:BIRCH
16300:Mila
16101:PaLM
16034:Bard
16014:BERT
15997:Text
15976:Sora
15444:Vids
15415:PaLM
15397:BERT
15314:Gato
15006:PMID
14886:help
14708:ISSN
14380:link
14313:ISBN
13980:ISSN
13784:ISSN
13409:ISSN
13074:ISSN
13012:ISSN
12923:ISSN
12814:PMID
12547:PMID
12539:ISSN
12498:ISBN
12470:ISSN
12338:ISSN
12146:PMID
12138:ISSN
11874:and
11864:BERT
10034:and
9357:H100
9345:BF16
9341:FP16
9337:A100
9303:and
8518:RoPE
8477:RoPE
8446:RoPE
8411:RoPE
8240:RoPE
7774:RoPE
7547:RoPE
7506:ReLU
7395:and
7375:and
7367:and
7357:BERT
7242:each
7227:each
7152:each
7137:each
6857:and
5551:head
5532:head
5483:head
4923:head
4829:head
4783:and
4494:and
4190:and
4101:and
3409:ReLU
2278:>
1691:and
1608:The
1530:Loss
1473:last
1449:The
1367:Sora
1321:BERT
1315:and
1305:ELMo
1279:for
1254:SOTA
1180:last
1113:LSTM
1075:BERT
1034:and
895:JMLR
880:ICLR
875:ICML
761:RLHF
577:LSTM
363:CURE
49:and
16041:NMT
15924:OCR
15919:HWR
15871:JAX
15825:VPU
15820:TPU
15815:IPU
15639:SGD
14996:PMC
14986:doi
14698:doi
14305:doi
13970:doi
13929:doi
13668:doi
12961:doi
12804:PMC
12794:doi
12720:doi
12587:doi
12531:doi
12462:doi
12389:doi
12330:doi
12130:doi
11954:Elo
11785:ViT
10552:sin
10521:cos
10487:sin
10456:cos
10288:to
9831:512
8124:sin
8088:cos
8051:sin
8015:cos
7901:cos
7887:sin
7871:sin
7854:cos
7239:for
7224:for
7149:for
7134:for
6689:FFN
6551:FFN
6497:FFN
6458:FFN
6376:FFN
6328:FFN
5649:768
5591:768
5521:768
5513:emb
5019:key
4966:key
4896:key
4769:key
3834:key
3455:emb
3439:ffn
3243:emb
3216:emb
2408:cos
2390:sin
2194:100
2122:emb
1984:emb
1967:or
1266:all
1127:or
1057:),
621:SOM
611:GAN
587:ESN
582:GRU
527:-NN
462:SDL
452:PGD
447:PCA
442:NMF
437:LDA
432:ICA
427:CCA
303:-NN
16461::
15004:.
14994:.
14982:21
14980:.
14976:.
14952:,
14928:,
14894:^
14877::
14875:}}
14871:{{
14855:^
14839:.
14759:36
14757:.
14753:.
14728:.
14706:.
14694:36
14692:.
14688:.
14594:.
14571:,
14544:30
14542:.
14538:.
14508:.
14454:^
14439:,
14400:,
14388:^
14376:}}
14372:{{
14358:.
14311:.
14303:.
14289:.
14220:.
14195:.
14171:.
14145:35
14143:.
14139:.
14115:,
14028:,
14009:.
13986:.
13978:.
13968:.
13956:48
13954:.
13950:.
13927:.
13913:.
13900:^
13882:32
13880:.
13876:.
13804:.
13782:.
13770:21
13768:.
13764:.
13750:^
13735:,
13708:32
13706:.
13702:.
13676:.
13666:.
13652:.
13627:.
13601:.
13597:.
13583:^
13562:.
13528:^
13513:,
13497:^
13481:.
13470:^
13454:.
13443:^
13407:.
13395:21
13393:.
13389:.
13355:^
13340:,
13299:,
13265:^
13240:^
13197:.
13173:.
13149:.
13124:.
13092:^
13072:.
13068:.
13044:,
13018:.
13010:.
13006:.
12994:^
12921:.
12917:.
12848:27
12846:.
12842:.
12820:,
12812:,
12802:,
12788:,
12732:^
12718:.
12690:^
12661:.
12642:.
12638:.
12593:.
12581:.
12577:.
12559:^
12545:.
12537:.
12527:26
12525:.
12521:.
12468:.
12456:.
12452:.
12422:.
12409:^
12395:.
12374:^
12350:^
12336:.
12324:.
12320:.
12280:.
12255:,
12243:^
12198:^
12177:.
12166:^
12152:.
12144:.
12136:.
12124:.
12116:;
12100:30
12098:.
12094:.
12054:^
11870:,
11866:,
11862:,
11858:,
11854:,
11850:,
11846:,
11767:.
10402:.
10305:ln
9307:.
8580:.
8171::=
7387:,
7334:do
7330:in
7328:t
7322:do
7318:in
7316:t
7310:do
7306:in
7304:t
7298:do
7294:in
7292:i
7286:do
7282:in
7280:t
7274:do
7270:in
7268:t
7262:do
7258:in
7256:t
7250:do
7246:in
7244:l
7235:do
7231:in
7229:t
7220:do
7216:in
7214:t
7208:do
7204:in
7202:t
7196:do
7192:in
7190:t
7184:do
7180:in
7178:t
7172:do
7168:in
7166:t
7160:do
7156:in
7154:l
7145:do
7141:in
7139:t
6595:.
6160:.
5640:12
5634:64
5585:64
5579:12
5559:64
5540:12
4561:,
4474:,
3901:.
3592:.
3469:.
2828:.
2622:.
2553:.
2152:.
1702:.
1679:.
1653:.
1623:.
1558:ln
1342:.
1300:.
1049:,
1038:.
979:A
890:ML
15554:e
15547:t
15540:v
15468:"
15464:"
15096:e
15089:t
15082:v
15067:.
15061::
15052:.
15046::
15012:.
14988::
14956::
14932::
14913:.
14907::
14888:)
14884:(
14867:.
14849:.
14825:.
14819::
14803:.
14797::
14782:.
14776::
14738:.
14714:.
14700::
14673:.
14667::
14651:.
14645::
14630:.
14624::
14609:.
14575::
14556:.
14550::
14523:.
14494:.
14488::
14472:.
14466::
14443::
14424:.
14404::
14382:)
14368:.
14321:.
14307::
14297::
14274:.
14268::
14252:.
14246::
14230:.
14206:.
14181:.
14157:.
14151::
14119::
14100:.
14094::
14079:.
14073::
14058:.
14052::
14032::
13994:.
13972::
13962::
13935:.
13931::
13921::
13894:.
13888::
13861:.
13855::
13840:.
13834::
13819:.
13790:.
13776::
13739::
13720:.
13714::
13687:.
13670::
13660::
13637:.
13612:.
13577:.
13548:.
13542::
13517::
13491:.
13464:.
13437:.
13431::
13415:.
13401::
13374:.
13368::
13344::
13303::
13284:.
13278::
13259:.
13253::
13212:.
13183:.
13159:.
13135:.
13110:.
13104::
13086:.
13048::
13029:.
12988:.
12982::
12967:.
12963::
12940:.
12902:.
12896::
12881:.
12875::
12860:.
12854::
12796::
12790:3
12773:.
12767::
12752:.
12746::
12726:.
12722::
12712::
12646:.
12644:9
12601:.
12589::
12583:4
12553:.
12533::
12506:.
12476:.
12464::
12458:6
12437:.
12403:.
12391::
12368:.
12362::
12344:.
12332::
12326:8
12305:.
12299::
12259::
12237:.
12231::
12216:.
12210::
12192:.
12160:.
12132::
12126:9
11942:)
11751:)
11748:I
11743:2
11735:,
11732:0
11729:(
11726:N
11704:D
11700:w
11696:,
11693:.
11690:.
11687:.
11684:,
11679:1
11675:w
11654:)
11647:k
11643:d
11636:/
11632:V
11627:T
11623:K
11619:(
11616:Q
11610:V
11606:)
11598:k
11594:d
11585:T
11580:K
11576:Q
11570:(
11561:=
11558:)
11555:V
11552:,
11549:K
11546:,
11543:Q
11540:(
11513:T
11508:i
11504:v
11500:)
11495:i
11491:k
11487:(
11459:4
11455:/
11451:1
11446:K
11442:d
11438:=
11412:)
11407:i
11403:k
11399:(
11389:2
11381:2
11377:/
11371:2
11361:i
11357:k
11349:e
11343:i
11333:T
11329:)
11325:q
11322:(
11312:T
11307:i
11303:v
11299:)
11294:i
11290:k
11286:(
11276:2
11268:2
11264:/
11258:2
11248:i
11244:k
11236:e
11230:i
11220:T
11216:)
11212:q
11209:(
11197:V
11193:)
11185:k
11181:d
11172:T
11167:K
11163:q
11157:(
11148:=
11145:)
11142:V
11139:,
11136:K
11133:,
11130:q
11127:(
11099:)
11096:y
11093:(
11083:2
11075:2
11071:/
11065:2
11057:y
11050:e
11046:,
11043:)
11040:x
11037:(
11027:2
11019:2
11015:/
11009:2
11001:x
10994:e
10984:]
10978:)
10975:y
10972:(
10962:2
10954:2
10950:/
10944:2
10936:y
10929:e
10925:,
10922:)
10919:x
10916:(
10906:2
10898:2
10894:/
10888:2
10880:x
10873:e
10866:[
10862:E
10858:=
10851:2
10842:/
10835:y
10832:,
10829:x
10822:e
10794:2
10786:2
10779:2
10771:y
10765:x
10752:e
10748:=
10745:]
10739:)
10736:y
10733:(
10727:,
10724:)
10721:x
10718:(
10709:[
10705:E
10684:)
10681:I
10676:2
10668:,
10665:0
10662:(
10659:N
10637:D
10633:w
10629:,
10626:.
10623:.
10620:.
10617:,
10612:1
10608:w
10585:T
10581:]
10574:x
10571:,
10566:D
10562:w
10549:,
10543:x
10540:,
10535:D
10531:w
10515:,
10509:x
10506:,
10501:1
10497:w
10484:,
10478:x
10475:,
10470:1
10466:w
10453:[
10447:D
10443:1
10438:=
10435:)
10432:x
10429:(
10418::
10390:)
10387:N
10384:(
10381:O
10357:)
10352:2
10348:N
10344:(
10341:O
10314:)
10311:N
10302:N
10299:(
10296:O
10276:)
10271:2
10267:N
10263:(
10260:O
10237:N
10217:)
10212:2
10208:N
10204:(
10201:O
10155:4
10145:x
10119:3
10109:x
10083:3
10079:x
10056:2
10046:x
10020:1
10010:x
9984:4
9974:x
9967:,
9962:3
9952:x
9945:,
9940:2
9930:x
9923:,
9918:1
9908:x
9880:t
9858:t
9854:x
9827:x
9823:,
9820:.
9817:.
9814:.
9811:,
9806:2
9802:x
9798:,
9793:1
9789:x
9723:O
9719:W
9714:)
9710:)
9705:V
9701:W
9697:X
9694:,
9689:K
9685:W
9681:X
9678:,
9673:Q
9668:i
9664:W
9660:X
9657:(
9648:(
9642:]
9633:n
9629:[
9623:i
9613:=
9610:)
9607:V
9604:,
9601:K
9598:,
9595:Q
9592:(
9563:V
9559:W
9555:,
9550:K
9546:W
9523:O
9519:W
9514:)
9510:)
9505:V
9500:i
9496:W
9492:X
9489:,
9484:K
9479:i
9475:W
9471:X
9468:,
9463:Q
9458:i
9454:W
9450:X
9447:(
9438:(
9432:]
9423:n
9419:[
9413:i
9403:=
9400:)
9397:V
9394:,
9391:K
9388:,
9385:Q
9382:(
9343:/
9271:j
9260:i
9256:=
9253:j
9247:i
9221:j
9217:,
9210:i
9205:B
9201:=
9196:j
9193:,
9190:i
9186:B
9161:B
9137:V
9133:)
9129:B
9126:+
9118:k
9114:d
9105:T
9100:K
9096:Q
9089:(
9080:=
9077:)
9074:V
9071:,
9068:K
9065:,
9062:Q
9059:(
8995:0
8975:i
8969:j
8966:=
8961:j
8958:,
8955:i
8951:B
8928:)
8890:0
8885:1
8877:2
8869:3
8854:1
8849:0
8844:1
8836:2
8821:2
8816:1
8811:0
8806:1
8791:3
8786:2
8781:1
8776:0
8770:(
8765:=
8762:B
8738:B
8718:s
8694:V
8690:)
8686:B
8683:s
8680:+
8672:k
8668:d
8659:T
8654:K
8650:Q
8643:(
8634:=
8631:)
8628:V
8625:,
8622:K
8619:,
8616:Q
8613:(
8568:k
8546:)
8541:k
8538:+
8535:n
8532:,
8529:y
8524:(
8512:T
8506:)
8500:k
8497:+
8494:m
8491:,
8488:x
8483:(
8473:=
8468:)
8463:n
8460:,
8457:y
8452:(
8440:T
8434:)
8428:m
8425:,
8422:x
8417:(
8385:)
8382:n
8379:(
8371:,
8368:.
8365:.
8362:.
8359:,
8354:)
8351:1
8348:(
8323:n
8320:2
8298:m
8294:z
8285:m
8282:i
8278:e
8274:=
8269:)
8264:m
8261:,
8256:m
8252:z
8246:(
8217:)
8214:2
8211:(
8206:m
8202:x
8198:i
8195:+
8190:)
8187:1
8184:(
8179:m
8175:x
8166:m
8162:z
8139:)
8130:m
8119:)
8116:1
8113:(
8108:m
8104:x
8100:+
8094:m
8083:)
8080:2
8077:(
8072:m
8068:x
8057:m
8046:)
8043:2
8040:(
8035:m
8031:x
8021:m
8010:)
8007:1
8004:(
7999:m
7995:x
7988:(
7983:=
7978:)
7970:)
7967:2
7964:(
7959:m
7955:x
7945:)
7942:1
7939:(
7934:m
7930:x
7923:(
7916:)
7907:m
7893:m
7877:m
7860:m
7848:(
7843:=
7838:)
7833:m
7830:,
7825:)
7822:2
7819:(
7814:m
7810:x
7806:,
7801:)
7798:1
7795:(
7790:m
7786:x
7780:(
7733:]
7730:.
7727:.
7724:.
7721:,
7718:)
7713:)
7710:2
7707:(
7702:3
7698:x
7694:,
7689:)
7686:1
7683:(
7678:3
7674:x
7670:(
7667:,
7664:)
7659:)
7656:2
7653:(
7648:2
7644:x
7640:,
7635:)
7632:1
7629:(
7624:2
7620:x
7616:(
7613:,
7610:)
7605:)
7602:2
7599:(
7594:1
7590:x
7586:,
7581:)
7578:1
7575:(
7570:1
7566:x
7562:(
7559:[
7472:]
7460:M
7453:0
7436:0
7429:[
7424:=
7415:M
7101:)
7098:)
7095:x
7092:(
7088:m
7085:r
7082:o
7079:N
7076:r
7073:e
7070:y
7067:a
7064:L
7060:(
7056:r
7053:e
7050:y
7047:a
7044:l
7041:b
7038:u
7035:S
7031:+
7028:x
7005:)
7002:x
6999:(
6995:r
6992:e
6989:y
6986:a
6983:l
6980:b
6977:u
6974:S
6953:)
6950:)
6947:x
6944:(
6940:r
6937:e
6934:y
6931:a
6928:l
6925:b
6922:u
6919:S
6915:+
6912:x
6909:(
6905:m
6902:r
6899:o
6896:N
6893:r
6890:e
6887:y
6884:a
6881:L
6770:E
6766:H
6741:)
6738:)
6733:E
6729:H
6725:,
6720:E
6716:H
6712:,
6705:H
6701:(
6693:(
6685:=
6678:)
6675:H
6672:(
6660:)
6657:H
6654:,
6651:H
6648:,
6645:H
6642:(
6634:=
6623:H
6530:)
6527:)
6524:H
6521:,
6518:H
6515:,
6512:H
6509:(
6501:(
6493:=
6490:)
6487:H
6484:(
6429:]
6416:)
6411:1
6407:)
6403:H
6400:,
6397:H
6394:,
6391:H
6388:(
6380:(
6368:)
6363:0
6359:)
6355:H
6352:,
6349:H
6346:,
6343:H
6340:(
6332:(
6321:[
6316:=
6309:)
6306:H
6303:(
6289:]
6274:1
6270:h
6260:0
6256:h
6249:[
6244:=
6237:H
6222:,
6217:1
6213:h
6209:,
6204:0
6200:h
6144:P
6122:1
6115:P
6105:M
6101:P
6075:]
6069:0
6059:0
6054:0
6049:0
6002:0
5997:0
5992:0
5964:0
5959:0
5923:0
5917:[
5912:=
5903:M
5875:V
5871:)
5862:k
5858:d
5849:T
5844:K
5840:Q
5834:+
5831:M
5827:(
5818:=
5815:)
5812:V
5809:,
5806:K
5803:,
5800:Q
5797:(
5768:0
5725:M
5705:1
5702:+
5699:t
5679:t
5643:)
5631:(
5626:R
5616:O
5612:W
5588:=
5556:=
5547:d
5543:,
5537:=
5528:n
5524:,
5518:=
5509:d
5479:d
5453:O
5449:W
5428:i
5406:V
5401:i
5397:W
5393:,
5388:K
5383:i
5379:W
5375:,
5370:Q
5365:i
5361:W
5340:X
5318:O
5314:W
5310:)
5307:)
5302:V
5297:i
5293:W
5289:X
5286:,
5281:K
5276:i
5272:W
5268:X
5265:,
5260:Q
5255:i
5251:W
5247:X
5244:(
5236:(
5231:]
5222:n
5218:[
5212:i
5202:=
5199:)
5196:V
5193:,
5190:K
5187:,
5184:Q
5181:(
5156:i
5120:)
5114:V
5110:W
5106:,
5101:K
5097:W
5093:,
5088:Q
5084:W
5079:(
5028:X
5024:=
5015:X
5002:X
4975:X
4971:=
4962:X
4958:=
4949:X
4919:d
4915:=
4906:d
4901:,
4892:d
4888:=
4879:d
4874:,
4861:=
4825:d
4795:d
4765:d
4735:d
4702:V
4698:)
4690:k
4686:d
4677:T
4672:K
4668:Q
4662:(
4653:=
4650:)
4647:V
4644:,
4641:K
4638:,
4635:Q
4632:(
4601:i
4597:v
4574:i
4570:k
4547:i
4543:q
4522:i
4502:V
4482:K
4462:Q
4435:i
4413:j
4410:i
4406:a
4385:i
4363:i
4359:k
4350:j
4346:q
4325:i
4305:j
4283:j
4279:k
4270:i
4266:q
4245:j
4225:i
4203:K
4199:W
4176:Q
4172:W
4143:k
4139:d
4114:j
4110:k
4087:i
4083:q
4058:j
4038:i
4016:j
4013:i
4009:a
3979:d
3975:=
3966:d
3943:V
3939:W
3935:,
3930:K
3926:W
3922:,
3917:Q
3913:W
3887:V
3883:W
3873:X
3869:=
3866:V
3844:K
3840:W
3830:X
3826:=
3823:K
3801:Q
3797:W
3787:X
3783:=
3780:Q
3758:Q
3754:W
3743:,
3740:i
3736:x
3732:=
3727:i
3723:q
3700:Q
3696:W
3668:,
3665:i
3661:x
3631:d
3578:V
3574:W
3551:K
3547:W
3524:Q
3520:W
3451:d
3447:4
3444:=
3435:d
3373:)
3370:2
3367:(
3363:b
3359:+
3354:)
3351:2
3348:(
3344:W
3340:)
3335:)
3332:1
3329:(
3325:b
3321:+
3316:)
3313:1
3310:(
3306:W
3302:x
3299:(
3293:=
3290:)
3287:x
3284:(
3280:N
3277:F
3274:F
3265::
3239:d
3212:d
3119:j
3115:c
3094:)
3091:t
3088:(
3085:f
3081:)
3077:)
3074:)
3069:j
3065:t
3058:(
3055:f
3052:(
3048:g
3045:a
3042:i
3039:d
3032:j
3028:c
3022:j
3013:(
3009:=
3006:)
3001:j
2997:t
2990:+
2987:t
2984:(
2981:f
2976:j
2972:c
2966:j
2937:R
2930:t
2907:)
2904:t
2901:(
2898:f
2895:)
2892:)
2889:t
2883:(
2880:f
2877:(
2873:g
2870:a
2867:i
2864:d
2860:=
2857:)
2854:t
2848:+
2845:t
2842:(
2839:f
2814:d
2810:/
2806:2
2802:N
2798:=
2795:r
2773:1
2765:2
2762:d
2757:,
2751:,
2748:1
2745:,
2742:0
2739:=
2736:k
2731:)
2724:k
2720:r
2715:/
2711:t
2708:i
2704:e
2700:(
2695:=
2692:)
2689:t
2686:(
2683:f
2662:2
2658:/
2654:d
2649:C
2640:R
2636::
2633:f
2607:=
2604:N
2584:k
2564:N
2539:d
2535:/
2531:2
2527:N
2523:=
2520:r
2517:,
2510:k
2506:r
2502:t
2497:=
2474:}
2471:1
2465:2
2461:/
2457:d
2454:,
2448:,
2445:1
2442:,
2439:0
2436:{
2430:k
2423:)
2420:)
2414:(
2405:,
2402:)
2396:(
2387:(
2384:=
2381:)
2376:1
2373:+
2370:k
2367:2
2363:)
2359:t
2356:(
2353:f
2350:,
2345:k
2342:2
2338:)
2334:t
2331:(
2328:f
2325:(
2301:d
2281:0
2275:d
2272:,
2268:Z
2261:d
2258:;
2253:d
2248:R
2239:R
2235::
2232:f
2191:=
2188:d
2185:,
2179:=
2176:N
2140:)
2131:n
2127:,
2118:d
2114:(
2094:)
2091:b
2088:+
2085:W
2082:x
2079:(
2075:x
2072:a
2069:m
2066:t
2063:f
2060:o
2057:s
2053:=
2050:)
2047:x
2044:(
2040:d
2037:e
2034:b
2031:m
2028:E
2025:n
2022:U
1980:d
1948:M
1945:]
1939:,
1936:0
1933:,
1930:0
1927:,
1924:1
1921:,
1918:0
1915:,
1912:0
1909:,
1906:0
1903:[
1900:=
1897:)
1894:3
1891:(
1887:d
1884:e
1881:b
1878:m
1875:E
1854:]
1848:,
1845:0
1842:,
1839:0
1836:,
1833:1
1830:,
1827:0
1824:,
1821:0
1818:,
1815:0
1812:[
1792:3
1772:M
1717:n
1667:W
1664:x
1580:)
1572:t
1564:(
1545:t
1534:=
1483:)
1053:(
959:e
952:t
945:v
525:k
374:k
301:k
259:)
247:(
34:.
20:)
Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.