Paper Discovery Feed
Explore domain hubs and recent submissions
We show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches.
We introduce BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers.
We combine pre-trained parametric and non-parametric memory for language generation, using a dense passage retriever to condition seq2seq models on retrieved documents.
We propose the Transformer, a model architecture based entirely on attention mechanisms, dispensing with recurrence and convolutions. Experiments show these models to be superior in quality while being more parallelizable.