DiLCo Methods Day 2022

Natural language processing for digital language

On 7 October 2022, DiLCo organised a "Methods Day "on computational and quantitative analysis of born-digital language with three public lectures:

Dong Nguyen/Anna Wegmann (Utrecht): Representing words and linguistic style for computational analysis: A gentle introduction
Adrien Barbaresi (BBAW): What do we mean when we talk about web corpora and how are they built?
Gregor Wiedemann (HBI): Similarity-based clustering and temporal visualisation of tweets for detecting disinformation narratives

All lectures have been recorded and are available open access in our DiLCo Video Repository.

For the abstracts see further below.

Abstracts

Representing Words and Linguistic Style for Computational Analysis: A Gentle Introduction

Dong Nguyen & Anna Wegmann

Universiteit Utrecht

Neural Network approaches have radically changed the field of NLP by introducing a way to represent the meaning of words: word embeddings. Such word embeddings are increasingly used as research objects to study social and linguistic research questions. More recently, researchers have also looked at representing sentences in a meaningful, data-driven way, including their style.

This lecture will first introduce word embeddings: What are word embeddings? And how are they learned from data? We will then continue with an introduction to representing the linguistic style of sentences: style embeddings. We explain how they can be created and illustrate how they can be used in downstream tasks, e.g. by analyzing linguistic style accommodation in conversations.

What do we mean when we talk about web corpora and how are they built?

Adrien Barbaresi

Berlin-Brandenburgische Akademie der Wissenschaften, Berlin

Using texts from the web to observe language seems simple, but methodological issues are inevitable. So the data collection phase can sometimes become project in itself.

After a brief history of web corpus linguistics, corpus building methods will be reviewed, from major data sources and their quirks to concrete steps focussing on the discovery and processing of web page contents, including the example of blogs and blog comments.

Similarity-based clustering and temporal visualisation of tweets for detecting disinformation narratives

Gregor Wiedemann

Leibniz-Institut für Medienforschung Hans-Bredow-Institut, Hamburg

Social networks became an important public arena in which discourse positions seek political hegemony. Especially for crisis-related issues such as the Covid-19 pandemic or the Russian war in Ukraine and its impact on other nations, certain discourse positions (mostly at the far right of the political spectrum) are massively backed by the spread of disinformation posts that get shared across platforms.

In this talk, I will demonstrate a method to detect and trace narrative patterns in Twitter posts concerning the conspiracy theory of US-financed labs for biological warfare in Ukraine. The method employs transformer-based sentence embeddings for the creation of tweet similarity networks that can reveal distinct events as well as the constant repeat of background knowledge for creating a disinformation narrative.

Collecting and Analyzing Tweets with Python

Fabian Barteld & Phillip Sandkühler

Universität Düsseldorf

This hands-on tutorial introduces how to collect and anaylize tweets and their meta data using Python. No programming knowledge is required.

Please bring you own laptop to work through the exercises of the tutorial. It is not necessary to install any software in advanced.

_{DiLCo (‘Digital language variation in context’) is a 3-year international research network initiated in 2021 at the University of Hamburg. The network brings together researchers from Europe and USA with expertise in computational, interactional, and ethnographic approaches to digital language and linguistics. It aims to provide a platform for the development of interdisciplinary ideas in digital language and communication research, and for early-career capacity building.}