# Jupyter Notebook for Interactive Labeling
______

This Jupyter Notebook combines a manual and automated labeling technique.
It includes scikit learn's Label Propagation Algorithm.
By calculating estimated class probabilities, we decide whether a news article has to be labeled manually or can be labeled automatically.
For multiclass labeling, 3 classes are used.

In each iteration we...
- check/correct the next 100 article labels manually.
 
- apply the Label Propagation classification algorithm which returns a vector class_probs $(K_1, K_2, ... , K_6)$ per sample with the probabilities $K_i$ per class $i$. Estimated class labels are adopted automatically, if the estimated probability $K_x > 0.99$ with $x \in {1,...,6}$.
 
Please note: User instructions are written in upper-case.
__________
Version: 2019-02-04, Anne Lorenz

In [1]:
import csv
import operator
import pickle
import random

from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets
from IPython.core.interactiveshell import InteractiveShell
from IPython.display import display
import numpy as np
import pandas as pd

from LabelPropagation import LabelPropagation
from MNBInteractive import MNBInteractive

## Part I: Data preparation

First, we import our data set of 10 000 business news articles from a csv file.
It contains 833/834 articles of each month of the year 2017.
For detailed information regarding the data set, please read the full documentation.

In [2]:
# round number to save intermediate label status of data set
m = -1

# initialize random => reproducible sequence
random.seed(5)

filepath = '../data/cleaned_data_set_without_header.csv'

# set up wider display area
pd.set_option('display.max_colwidth', -1)

# set precision of output
np.set_printoptions(precision=3)

# show full text for print statement
InteractiveShell.ast_node_interactivity = "all"

In [3]:
df = pd.read_csv(filepath,
 header=None,
 sep='|',
 engine='python',
 names = ["Uuid", "Title", "Text", "Site", "SiteSection", "Url", "Timestamp"],
 decimal='.',
 quotechar='\'',
 quoting=csv.QUOTE_NONNUMERIC)

# add column for indices
df['Index'] = df.index.values.astype(int)

# add round annotation (indicates labeling time)
df['Round'] = np.nan

# initialize label column with -1 for unlabeled samples
df['Label'] = np.full((len(df)), -1).astype(int)

# add column for estimated probability
df['Probability'] = np.nan

# store auto-estimated label, initialize with -1 for unestimated samples
df['Estimated'] = np.full((len(df)), -1).astype(int)

# row number
n_rows = df.shape[0]
print('Number of samples in data set in total: {}'.format(n_rows))

Number of samples in data set in total: 10000


We load the previously created dictionary of all article indices (keys) with a list of mentioned organizations (values).
In the following, we limit the number of occurences of a certain company name in all labeled articles to 3 to avoid imbalance.

In [4]:
def show_next(index):
 ''' this method displays an article's text and an interactive slider to set its label manually
 '''
 print('News article no. {}:'.format(index))
 print()
 print('HEADLINE:')
 print(df.loc[df['Index'] == index, 'Title'])
 print()
 print('TEXT:')
 print(df.loc[df['Index'] == index, 'Text'])
 
 def f(x):
 # save user input
 df.loc[df['Index'] == index, 'Label'] = x
 df.loc[df['Index'] == index, 'Round'] = m

 # create slider widget for labels
 interact(f, x = widgets.IntSlider(min=-1, max=2, step=1, value=df.loc[df['Index'] == index, 'Estimated']))
 print('0: Other/Unrelated news, 1: Merger,') 
 print('2: Topics related to deals, investments and mergers')
 print('(e.g. merger pending/in talks/to be approved or merger rejected/aborted/denied or sale of unit or')
 print('Share Deal/Asset Deal/acquisition or merger as incidental remark/not main topic/not current or speculative)')
 print('___________________________________________________________________________________________________________')
 print()
 print()

# list of article indices that will be shown next
label_next = []

In [5]:
# global dict of all articles (article index => list of mentioned organizations)
dict_art_orgs = {}
with open('../obj/dict_articles_organizations_without_banks.pkl', 'rb') as input:
 dict_art_orgs = pickle.load(input)

# global dict of mentioned companies in labeled articles (company name => number of occurences
dict_limit = {}

The iteration part starts here:

## Part II: Manual checking of estimated labels

PLEASE INSERT M MANUALLY IF PROCESS HAS BEEN INTERRUPTED BEFORE.

In [6]:
m = 9

In [7]:
# read current data set from csv
df = pd.read_csv('../data/interactive_labeling_round_{}.csv'.format(m),
 sep='|',
 usecols=range(1,13), # drop first column 'unnamed'
 encoding='utf-8',
 quoting=csv.QUOTE_NONNUMERIC,
 quotechar='\'')

# find current iteration/round number
m = int(df['Round'].max())
print('Last round number: {}'.format(m))
print('Number of manually labeled articles: {}'.format(len(df.loc[df['Label'] != -1])))
print('Number of manually unlabeled articles: {}'.format(len(df.loc[df['Label'] == -1])))

Last round number: 9
Number of manually labeled articles: 1000
Number of manually unlabeled articles: 9000


In [8]:
# initialize dict_limit
df_labeled = df[df['Label'] != -1]

for index in df_labeled['Index']:
 orgs = dict_art_orgs[index]
 for org in orgs:
 if org in dict_limit:
 dict_limit[org] += 1
 else:
 dict_limit[org] = 1

In [None]:
# OPTIONAL:
# print organizations that are mentioned 3 times and therefore limited
for k, v in dict_limit.items():
 if v == 3:
 print(k)

We now check (and correct if necessary) the next 100 auto-labeled articles.

In [None]:
if m == -1:
 indices = list(range(10000))
else:
 # indices of recently auto-labeled articles
 indices = df.loc[(df['Estimated'] != -1) & (df['Label'] == -1), 'Index'].tolist()

In [None]:
# increment round number
m += 1
print('This round number: {}'.format(m))

In [None]:
def pick_random_articles(n, limit = 3):
 ''' pick n random articles, check if company occurences under limit.
 returns list of n indices of the articles we can label next.
 '''
 # labeling list
 list_arts = []
 # article counter
 i = 0
 while i < n:
 # pick random article
 rand_i = random.choice(indices)
 # list of companies in that article
 companies = dict_art_orgs[rand_i]
 if all((dict_limit.get(company) == None) or (dict_limit[company] < limit ) for company in companies): 
 for company in companies:
 if company in dict_limit:
 dict_limit[company] += 1
 else:
 dict_limit[company] = 1
 # add article to labeling list
 list_arts.append(rand_i)
 indices.remove(rand_i)
 i += 1
 return list_arts

In [None]:
# generate new list of article indices for labeling
batchsize = 100
label_next = pick_random_articles(batchsize)

PLEASE READ THE FOLLOWING ARTICLES AND ENTER THE CORRESPONDING LABELS:

In [None]:
for index in label_next:
 show_next(index)

In [None]:
print('Number of manual labels in round no. {}:'.format(m))
print('0:{}, 1:{}, 2:{}'.format(len(df.loc[(df['Label'] == 0) & (df['Round'] == m)]), len(df.loc[(df['Label'] == 1) & (df['Round'] == m)]), len(df.loc[(df['Label'] == 2) & (df['Round'] == m)])))

print('Number of articles to be corrected in this round: {}'.format(len(df.loc[(df['Label'] != -1) & (df['Estimated'] != -1) & (df['Round'] == m) & (df['Label'] != df['Estimated'])])))

In [None]:
# save intermediate status
df.to_csv('../data/interactive_labeling_round_{}_temp.csv'.format(m),
 sep='|',
 mode='w',
 encoding='utf-8',
 quoting=csv.QUOTE_NONNUMERIC,
 quotechar='\'')

In [None]:
#df.loc[df['Label'] != -1][:100]

## Part III: Model building and automated labeling

In [None]:
# THIS CELL IS OPTIONAL

# read current data set from csv
m = 
df = pd.read_csv('../data/interactive_labeling_round_{}_temp.csv'.format(m),
 sep='|',
 usecols=range(1,13), # drop first column 'unnamed'
 encoding='utf-8',
 quoting=csv.QUOTE_NONNUMERIC,
 quotechar='\'')

We build a classification model and check if it is possible to label articles automatically.

In [9]:
# use sklearn's CountVectorizer
cv = False

# call script with manually labeled and manually unlabeled samples
%time class_probs, predictions = LabelPropagation.propagate_labels(df.loc[df['Label'] != -1], df.loc[df['Label'] == -1], cv)

# MNB: starting label propagation
# BOW: extracting all words from articles...

# BOW: making vocabulary of data set...

# BOW: vocabulary consists of 14414 features.

# MNB: fit training data and calculate matrix...

# BOW: calculating matrix...

# BOW: calculating frequencies...

# MNB: transform testing data to matrix...

# BOW: extracting all words from articles...

# BOW: calculating matrix...

# BOW: calculating frequencies...



 probabilities /= normalizer
 probabilities /= normalizer


# MNB: ending label propagation
Wall time: 41min 56s


We label each article with class $j$, if its estimated probability for class $j$ is higher than our threshold:

In [10]:
print(class_probs[:100])
print(predictions[:100])

[[nan nan nan]
 [nan nan nan]
 [nan nan nan]
 [nan nan nan]
 [nan nan nan]
 [nan nan nan]
 [nan nan nan]
 [nan nan nan]
 [nan nan nan]
 [nan nan nan]
 [nan nan nan]
 [nan nan nan]
 [nan nan nan]
 [nan nan nan]
 [nan nan nan]
 [nan nan nan]
 [nan nan nan]
 [nan nan nan]
 [nan nan nan]
 [nan nan nan]
 [nan nan nan]
 [nan nan nan]
 [nan nan nan]
 [nan nan nan]
 [nan nan nan]
 [nan nan nan]
 [nan nan nan]
 [nan nan nan]
 [nan nan nan]
 [nan nan nan]
 [nan nan nan]
 [nan nan nan]
 [nan nan nan]
 [nan nan nan]
 [nan nan nan]
 [nan nan nan]
 [nan nan nan]
 [nan nan nan]
 [nan nan nan]
 [nan nan nan]
 [nan nan nan]
 [nan nan nan]
 [nan nan nan]
 [nan nan nan]
 [nan nan nan]
 [nan nan nan]
 [nan nan nan]
 [nan nan nan]
 [nan nan nan]
 [nan nan nan]
 [nan nan nan]
 [nan nan nan]
 [nan nan nan]
 [nan nan nan]
 [nan nan nan]
 [nan nan nan]
 [nan nan nan]
 [nan nan nan]
 [nan nan nan]
 [nan nan nan]
 [nan nan nan]
 [nan nan nan]
 [nan nan nan]
 [nan nan nan]
 [nan nan nan]
 [nan nan nan]
 [nan nan 

In [None]:
# only labels with this minimum probability are adopted
threshold = 0.99
# dict for counting estimated labels
estimated_labels = {0:0, 1:0, 2:0}

# series of indices of recently estimated articles 
indices_estimated = df.loc[df['Label'] == -1, 'Index'].tolist()

# for every row i and every element j in row i
for (i,j), value in np.ndenumerate(class_probs):
 # check if probability of class i is not less than threshold
 if class_probs[i][j] > threshold:
 index = indices_estimated[i]
 # save estimated label
 df.loc[index, 'Estimated'] = classes[j]
 # annotate probability
 df.loc[index, 'Probability'] = value
 # count labels
 estimated_labels[int(classes[j])] += 1

In [None]:
print('Number of auto-labeled samples in round {}: {}'.format(m, sum(estimated_labels.values())))
print('Estimated labels: {}'.format(estimated_labels))

In [None]:
# THIS CELL IS OPTIONAL
# let the Naive Bayes Algorithm test the quality of data set's labels

# split data into text and label set
X = df.loc[df['Label'] != -1, 'Title'] + '. ' + df.loc[df['Label'] != -1, 'Text']
X = X.reset_index(drop=True)
y = df.loc[df['Label'] != -1, 'Label']
y = y.reset_index(drop=True)

# use sklearn's CountVectorizer
cv = False

# call script with manually labeled and manually unlabeled samples
#%time MNBInteractive.measure_mnb(X, y, cv)

In [None]:
print('End of this round (no. {}):'.format(m))
print('Number of manually labeled articles: {}'.format(len(df.loc[df['Label'] != -1])))
print('Number of manually unlabeled articles: {}'.format(len(df.loc[df['Label'] == -1])))

In [None]:
# save this round to csv
df.to_csv('../data/interactive_labeling_round_{}.csv'.format(m),
 sep='|',
 mode='w',
 encoding='utf-8',
 quoting=csv.QUOTE_NONNUMERIC,
 quotechar='\'')

NOW PLEASE CONTINUE WITH PART II.
REPEAT UNTIL ALL SAMPLES ARE LABELED.