Task and Data

Task

Given the messages a user has posted on a Twich channel, the task is to predict whether or not the user is subscribed to the channel.

Data

We provide a dataset of publicly available Twitch comments including metadata that we obtained via Twitch’s official API. The data includes information about whether the commenting user is subscribed to the channel he or she is commenting on. Given unseen users and some channels, your task is to predict whether the user is subscribed to the channel or not.

For more detailed information on the datasets and a brief analysis, please take a look at the overview paper.

Training Dataset

The complete dataset consists of more than 400 000 000 public Twitch comments from English channels across one month (January 2020) and includes some metadata.

> DOWNLOAD THE TRAINING DATASET HERE.

It contains:

After you downloaded the file, extract it using tar -xzvf train.tar.gz (Attention: The training dataset has a size of around 38 GB).

In the evaluation, you need to write a file with the syntax of train_truth.csv for the test set, which consists of 90 000 unseen channel-user combinations and their comments. 50 % of the users in the test set are present in the training data with their contributions to other channels.

Testing Dataset

We now also provide the testing dataset for analysis purposes and future research.

> DOWNLOAD THE TEST DATASET HERE.

Data Format

Comment Data

Each line of the dataset (training and testing) file is a JSON object with the following keys:

An example line (beautified):

{
    "c": "147b9b331f7a6b4e993869cc12ed90de",
    "u": "e14a9064f7c281b255dc0c93e8db1b81",
    "ms": [
        {
            "t": 1577836800003,
            "g": "Just Chatting",
            "m": "hachu20"
        },
        ...
    ]
}

In this case, the message “hachu20” is the text description of an emote. Emotes are small still or moving images that are very popular on Twitch.

Ground Truth

The ground truth file are CSV files containing all user-channel combinations from the training/testing set.

Example:

user,channel,subscribed
09f7e02f1290be211da707a266f153b3,52f83ff6877e42f613bcd2444c22528c,False
52f83ff6877e42f613bcd2444c22528c,09f7e02f1290be211da707a266f153b3,True
...

In the testing dataset’s test_truth.csv, you find multiple additional columns that help you with your analysis:

Evaluation

In order to ensure that there are different user activities covered in the testset, we performed the following procedure to sample 90 000 user-channel combinations:

  1. We analyzed the number of messages in the dataset per user/channel (user 1 has made 20 comments, channel X has got 273 comments, …)
  2. We categorized each user/channel into high/normal/low activity based on the number of comments: lowest and largest 25 % of the users/channels are of low and high activity, respectively)
  3. All user-channel combinations for which the subscription changed in the time span were removed (i.e. users subscribed or unsubscribed a channel in January)
  4. We sampled 10 000 channel-user combinations for each activity combination (i.e. highly active user and normally active channel, …), so 90 000 channel-user combinations were sampled in total for the test dataset

You will submit an estimate for the subscription status of each user-channel combination in the testset using the same CSV structure of the ground truth. Your submission is then judged by the F1 score regarding the binary subscription status:

The following parts of the formula mean: true positives: Number of user-channel combinations that are correctly identified as subscribed; false positives: Number of user-channel combinations that were falsely estimated as subscribed; false negatives: Number of user-channel combinations that were falsely estimated as not subscribed.

\[F_1 = \frac{2 \cdot \text{true positives}}{2 \cdot \text{true positives} + \text{false negatives} + \text{false positives}}\]

The Python script that was used to evaluate submissions can be found here.

Submission

The submission took place over TIRA. For evaluation, the software was uploaded to the service that runs the code in a virtual machine. It has access to the dataset from a given directory and writes a valid predictions.csv with predictions to a given output directory.

More information on the submission process can be found here.

Baseline

We provided a random sampling baseline, which samples randomly from the training class distribution; 8.02% of the training channel-user combinations is subscribed, 91.98% is not. Sampling from this distribution, we get an expected F1 score of 7.41%. Results of the participants can be found here.

Regarding other datasets

No other data except the provided data can be directly used to solve the challenge’s task. You can, however, use other data to e.g. pretrain your model.

Related Work

Legal Information/Impressum