What do we do?

We're the Culture and Computation lab! We build computational methodologies to help people learn about culture through documents.

Our focus is divided into three core areas: digital humanities, machine learning, and computational social science. Here are some of our favorite papers, in no particular order…

The Chatbot and the Canon: Poetry Memorization in LLMs

By Lyra D'Souza and David Mimno in 2024.

To what degree do large language models (LLMs) memorize their training set? Evaluating this question is important because LLMs are often trained on sensitive data. Can they memorize entire stretches of their training data verbatim? We approach this issue by evaluating whether ChatGPT memorizes poems. Why? They’re short in length, widely read, and easily available online. We find commercially available models can (and do) memorize entire poems from their training set, raising the question of what else they might be memorizing.

Small Worlds: Measuring the Mobility of Characters in English-Language Fiction

By Matthew Wilkens, Elizabeth Evans, Sandeep Soni, David Bamman, and Andrew Piper. Journal of Computational Literary Studies, 2024.

The representation of mobility in literary narratives has important implications for the cultural understanding of human movement and migration. In this paper, we introduce novel methods for measuring the physical mobility of literary characters through narrative space and time. We capture mobility through geographically defined space, as well as through generic locations such as homes, driveways, and forests. Using a dataset of over 13,000 books published in English since 1789, we observe significant 'small world' effects in fictional narratives. Specifically, we find that fictional characters cover far less distance than their nonfictional counterparts; the pathways covered by fictional characters are highly formulaic and limited from a global perspective; and fiction exhibits a distinctive semantic investment in domestic and private places. Surprisingly, we do not find that characters' ascribed gender has a statistically significant effect on distance traveled, but it does influence the semantics of domesticity.

Hyperpolyglot LLMs: Cross-Lingual Interpretability in Token Embeddings

By Andrea W Wen-Yi and David Mimno in 2024.

How do LLMs represent relationships between languages? Every language model has an input layer that maps tokens to vectors. This ubiquitous layer of language models is often overlooked. We find that similarities between these input embeddings are highly interpretable and that the geometry of these embeddings differs between model families. In one case (XLM-RoBERTa), embeddings encode language. Another family (mT5) represents cross-lingual semantic similarity: the 50 nearest neighbors for any token represent an average of 7.61 writing systems, and are frequently translations. This result is surprising given that there is no explicit parallel cross-lingual training corpora and no explicit incentive for translations in pre-training objectives.

T5 meets Tybalt: Author Attribution in Early Modern English Drama Using Large Language Models

By Rebecca M. M. Hicke and David Mimno in 2024.

This paper considers whether large language models (LLMs) can be used for stylometry, specifically authorship identification in Early Modern English drama. We’re interested in whether LLMs can perform well in attribution scenarios in which other methods structure: with very short texts (5-450 words) and challenging texts (Early Modern drama). We find both promising and concerning results; LLMs are able to accurately predict the author of surprisingly short passages but are also prone to confidently misattribute texts to specific authors.

Endometriosis Online Communities: A Quantitative Analysis

By Federica Bologna, Rosamond Thalken, Kristen Pepin, Matthew Wilkens in 2024.

We investigate patient needs and support strategies in endometriosis communities on Reddit. We identify associations between a post’s subject matter (“topics”), the people and relationships (“personas”) mentioned, and the type of support the post seeks (“intent”). Our results emphasize that members of the OHCs need greater empathy within clinical settings, easier access to appointments, more information on care pathways, and more support for their loved ones. Endometriosis OHCs currently fulfill some of these needs as they provide members a space where they can receive validation, discuss care pathways, and learn to manage symptoms.

The Afterlives of Shakespeare and Company in Online Social Readership

By Maria Antoniak, David Mimno, Rosamond Thalken, Melanie Walsh, Matthew Wilkens, and Gregory Yauney in 2024.

In this article, we explore the extent to which we can make comparisons between the Shakespeare and Company and Goodreads communities. By quantifying similarities and differences, we can identify patterns in how works have risen or fallen in popularity across these datasets. We can also measure differences in how works are received by measuring similarities and differences in co-reading patterns. Finally, by examining the complete networks of co-readership, we can observe changes in the overall structures of literary reception.

Data Similarity is Not Enough to Explain Language Model Performance

By Gregory Yauney, Emily Reif, and David Mimno in 2023.

Large language models achieve high performance on many but not all downstream tasks. The interaction between pretraining data and task data is commonly assumed to determine this variance: a task with data that is more similar to a model’s pretraining data is assumed to be easier for that model. We test whether distributional and example-specific similarity measures correlate with language model performance through a large-scale comparison of the Pile and C4 pretraining datasets with downstream benchmarks. Similarity correlates with performance for multilingual datasets, but in other benchmarks, we surprisingly find that similarity metrics are not correlated with accuracy or even each other. This suggests that the relationship between pretraining data and downstream tasks is more complex than often assumed.

Modeling Legal Reasoning: LM Annotation at the Edge of Human Agreement

By Rosamond Thalken, Edward H. Stiglitz, David Mimno, and Matthew Wilkens in 2023.

We consider whether large language models can perform a task that is challenging even for humans: the classification of legal reasoning. Using a dataset of U.S. Supreme Court opinions annotated by a team of domain experts, we systematically test the performance of a variety of LLMs to classify legal reasoning. We find that generative models perform poorly when given instructions (i.e. prompts) equal to the instructions presented to human annotators through our codebook. Our findings generally sound a note of caution in the use of generative LMs on complex tasks without fine-tuning and point to the continued relevance of human annotation-intensive classification methods.