Task
Given the messages a user has posted on a Twich channel, the task is to predict whether or not the user is subscribed to the channel.
Data
We provide a dataset of publicly available Twitch comments including metadata that we obtained via Twitch’s official API. The data includes information about whether the commenting user is subscribed to the channel he or she is commenting on. Given unseen users and some channels, your task is to predict whether the user is subscribed to the channel or not.
For more detailed information on the datasets and a brief analysis, please take a look at the overview paper.
Training Dataset
The complete dataset consists of more than 400 000 000 public Twitch comments from English channels across one month (January 2020) and includes some metadata.
> DOWNLOAD THE TRAINING DATASET HERE.
It contains:
- the training channel-user combinations together with all comments made by the user in the channel (
train.json
). - all training channel-user combinations and the corresponding subscription status (
train_truth.csv
).
After you downloaded the file, extract it using tar -xzvf train.tar.gz
(Attention: The training dataset has a size of around 38 GB).
In the evaluation, you need to write a file with the syntax of train_truth.csv
for the test set, which consists of 90 000 unseen channel-user combinations and their comments.
50 % of the users in the test set are present in the training data with their contributions to other channels.
Testing Dataset
We now also provide the testing dataset for analysis purposes and future research.
> DOWNLOAD THE TEST DATASET HERE.
Data Format
Comment Data
Each line of the dataset (training and testing) file is a JSON object with the following keys:
c
: The name of the channel (anonymized due to privacy reasons)u
: The username of the commenting user (anonymized due to privacy reasons)ms
: A list of JSON objects containing the following keys:t
: The timestamp when the user commentedg
: The game that was played in the channel when the user commentedm
: The chat comment/message as text
An example line (beautified):
{
"c": "147b9b331f7a6b4e993869cc12ed90de",
"u": "e14a9064f7c281b255dc0c93e8db1b81",
"ms": [
{
"t": 1577836800003,
"g": "Just Chatting",
"m": "hachu20"
},
...
]
}
In this case, the message “hachu20” is the text description of an emote. Emotes are small still or moving images that are very popular on Twitch.
Ground Truth
The ground truth file are CSV files containing all user-channel combinations from the training/testing set.
user
: The username of the commenting user (anonymized due to privacy reasons; same as in dataset)channel
: The name of the channel (anonymized due to privacy reasons; same as in dataset)subscribed
: Whether or not the user is subscribed to the channel. Channel-user combinations that changed subscription status during the data period were removed from the complete dataset.
Example:
user,channel,subscribed
09f7e02f1290be211da707a266f153b3,52f83ff6877e42f613bcd2444c22528c,False
52f83ff6877e42f613bcd2444c22528c,09f7e02f1290be211da707a266f153b3,True
...
In the testing dataset’s test_truth.csv
, you find multiple additional columns that help you with your analysis:
channel
,user
,subscribed
as in the training datasetchannel_activity
anduser_activity
give a class (“low”, “normal”, or “high”) according to the number of messages they write/receive.messages
: number of messages of that user in that channelin_train
: whether the user is present in the training dataset or not
Evaluation
In order to ensure that there are different user activities covered in the testset, we performed the following procedure to sample 90 000 user-channel combinations:
- We analyzed the number of messages in the dataset per user/channel (user 1 has made 20 comments, channel X has got 273 comments, …)
- We categorized each user/channel into high/normal/low activity based on the number of comments: lowest and largest 25 % of the users/channels are of low and high activity, respectively)
- All user-channel combinations for which the subscription changed in the time span were removed (i.e. users subscribed or unsubscribed a channel in January)
- We sampled 10 000 channel-user combinations for each activity combination (i.e. highly active user and normally active channel, …), so 90 000 channel-user combinations were sampled in total for the test dataset
You will submit an estimate for the subscription status of each user-channel combination in the testset using the same CSV structure of the ground truth. Your submission is then judged by the F1 score regarding the binary subscription status:
The following parts of the formula mean: true positives: Number of user-channel combinations that are correctly identified as subscribed; false positives: Number of user-channel combinations that were falsely estimated as subscribed; false negatives: Number of user-channel combinations that were falsely estimated as not subscribed.
\[F_1 = \frac{2 \cdot \text{true positives}}{2 \cdot \text{true positives} + \text{false negatives} + \text{false positives}}\]The Python script that was used to evaluate submissions can be found here.
Submission
The submission took place over TIRA.
For evaluation, the software was uploaded to the service that runs the code in a virtual machine.
It has access to the dataset from a given directory and writes a valid predictions.csv
with predictions to a given output directory.
More information on the submission process can be found here.
Baseline
We provided a random sampling baseline, which samples randomly from the training class distribution; 8.02% of the training channel-user combinations is subscribed, 91.98% is not. Sampling from this distribution, we get an expected F1 score of 7.41%. Results of the participants can be found here.
Regarding other datasets
No other data except the provided data can be directly used to solve the challenge’s task. You can, however, use other data to e.g. pretrain your model.