{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Jupyter Notebook for Interactive Labeling\n", "______\n", "\n", "This Jupyter Notebook combines a manual and automated labeling technique.\n", "It includes scikit learn's Label Propagation Algorithm.\n", "By calculating estimated class probabilities, we decide whether a news article has to be labeled manually or can be labeled automatically.\n", "For multiclass labeling, 3 classes are used.\n", "\n", "In each iteration we...\n", "- check/correct the next 100 article labels manually.\n", " \n", "- apply the Label Propagation classification algorithm which returns a vector class_probs $(K_1, K_2, ... , K_6)$ per sample with the probabilities $K_i$ per class $i$. Estimated class labels are adopted automatically, if the estimated probability $K_x > 0.99$ with $x \\in {1,...,6}$.\n", " \n", "Please note: User instructions are written in upper-case.\n", "__________\n", "Version: 2019-02-04, Anne Lorenz" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "scrolled": true }, "outputs": [], "source": [ "import csv\n", "import operator\n", "import pickle\n", "import random\n", "\n", "from ipywidgets import interact, interactive, fixed, interact_manual\n", "import ipywidgets as widgets\n", "from IPython.core.interactiveshell import InteractiveShell\n", "from IPython.display import display\n", "import numpy as np\n", "import pandas as pd\n", "\n", "from LabelPropagation import LabelPropagation\n", "from MNBInteractive import MNBInteractive" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part I: Data preparation\n", "\n", "First, we import our data set of 10 000 business news articles from a csv file.\n", "It contains 833/834 articles of each month of the year 2017.\n", "For detailed information regarding the data set, please read the full documentation." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# round number to save intermediate label status of data set\n", "m = -1\n", "\n", "# initialize random => reproducible sequence\n", "random.seed(5)\n", "\n", "filepath = '../data/cleaned_data_set_without_header.csv'\n", "\n", "# set up wider display area\n", "pd.set_option('display.max_colwidth', -1)\n", "\n", "# set precision of output\n", "np.set_printoptions(precision=3)\n", "\n", "# show full text for print statement\n", "InteractiveShell.ast_node_interactivity = \"all\"" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of samples in data set in total: 10000\n" ] } ], "source": [ "df = pd.read_csv(filepath,\n", " header=None,\n", " sep='|',\n", " engine='python',\n", " names = [\"Uuid\", \"Title\", \"Text\", \"Site\", \"SiteSection\", \"Url\", \"Timestamp\"],\n", " decimal='.',\n", " quotechar='\\'',\n", " quoting=csv.QUOTE_NONNUMERIC)\n", "\n", "# add column for indices\n", "df['Index'] = df.index.values.astype(int)\n", "\n", "# add round annotation (indicates labeling time)\n", "df['Round'] = np.nan\n", "\n", "# initialize label column with -1 for unlabeled samples\n", "df['Label'] = np.full((len(df)), -1).astype(int)\n", "\n", "# add column for estimated probability\n", "df['Probability'] = np.nan\n", "\n", "# store auto-estimated label, initialize with -1 for unestimated samples\n", "df['Estimated'] = np.full((len(df)), -1).astype(int)\n", "\n", "# row number\n", "n_rows = df.shape[0]\n", "print('Number of samples in data set in total: {}'.format(n_rows))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We load the previously created dictionary of all article indices (keys) with a list of mentioned organizations (values).\n", "In the following, we limit the number of occurences of a certain company name in all labeled articles to 3 to avoid imbalance." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "def show_next(index):\n", " ''' this method displays an article's text and an interactive slider to set its label manually\n", " '''\n", " print('News article no. {}:'.format(index))\n", " print()\n", " print('HEADLINE:')\n", " print(df.loc[df['Index'] == index, 'Title'])\n", " print()\n", " print('TEXT:')\n", " print(df.loc[df['Index'] == index, 'Text'])\n", " \n", " def f(x):\n", " # save user input\n", " df.loc[df['Index'] == index, 'Label'] = x\n", " df.loc[df['Index'] == index, 'Round'] = m\n", "\n", " # create slider widget for labels\n", " interact(f, x = widgets.IntSlider(min=-1, max=2, step=1, value=df.loc[df['Index'] == index, 'Estimated']))\n", " print('0: Other/Unrelated news, 1: Merger,') \n", " print('2: Topics related to deals, investments and mergers')\n", " print('(e.g. merger pending/in talks/to be approved or merger rejected/aborted/denied or sale of unit or')\n", " print('Share Deal/Asset Deal/acquisition or merger as incidental remark/not main topic/not current or speculative)')\n", " print('___________________________________________________________________________________________________________')\n", " print()\n", " print()\n", "\n", "# list of article indices that will be shown next\n", "label_next = []" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "# global dict of all articles (article index => list of mentioned organizations)\n", "dict_art_orgs = {}\n", "with open('../obj/dict_articles_organizations_without_banks.pkl', 'rb') as input:\n", " dict_art_orgs = pickle.load(input)\n", "\n", "# global dict of mentioned companies in labeled articles (company name => number of occurences\n", "dict_limit = {}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The iteration part starts here:\n", "\n", "## Part II: Manual checking of estimated labels" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "PLEASE INSERT M MANUALLY IF PROCESS HAS BEEN INTERRUPTED BEFORE." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "m = 9" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Last round number: 9\n", "Number of manually labeled articles: 1000\n", "Number of manually unlabeled articles: 9000\n" ] } ], "source": [ "# read current data set from csv\n", "df = pd.read_csv('../data/interactive_labeling_round_{}.csv'.format(m),\n", " sep='|',\n", " usecols=range(1,13), # drop first column 'unnamed'\n", " encoding='utf-8',\n", " quoting=csv.QUOTE_NONNUMERIC,\n", " quotechar='\\'')\n", "\n", "# find current iteration/round number\n", "m = int(df['Round'].max())\n", "print('Last round number: {}'.format(m))\n", "print('Number of manually labeled articles: {}'.format(len(df.loc[df['Label'] != -1])))\n", "print('Number of manually unlabeled articles: {}'.format(len(df.loc[df['Label'] == -1])))" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "# initialize dict_limit\n", "df_labeled = df[df['Label'] != -1]\n", "\n", "for index in df_labeled['Index']:\n", " orgs = dict_art_orgs[index]\n", " for org in orgs:\n", " if org in dict_limit:\n", " dict_limit[org] += 1\n", " else:\n", " dict_limit[org] = 1" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "# OPTIONAL:\n", "# print organizations that are mentioned 3 times and therefore limited\n", "for k, v in dict_limit.items():\n", " if v == 3:\n", " print(k)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We now check (and correct if necessary) the next 100 auto-labeled articles." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "if m == -1:\n", " indices = list(range(10000))\n", "else:\n", " # indices of recently auto-labeled articles\n", " indices = df.loc[(df['Estimated'] != -1) & (df['Label'] == -1), 'Index'].tolist()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# increment round number\n", "m += 1\n", "print('This round number: {}'.format(m))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def pick_random_articles(n, limit = 3):\n", " ''' pick n random articles, check if company occurences under limit.\n", " returns list of n indices of the articles we can label next.\n", " '''\n", " # labeling list\n", " list_arts = []\n", " # article counter\n", " i = 0\n", " while i < n:\n", " # pick random article\n", " rand_i = random.choice(indices)\n", " # list of companies in that article\n", " companies = dict_art_orgs[rand_i]\n", " if all((dict_limit.get(company) == None) or (dict_limit[company] < limit ) for company in companies): \n", " for company in companies:\n", " if company in dict_limit:\n", " dict_limit[company] += 1\n", " else:\n", " dict_limit[company] = 1\n", " # add article to labeling list\n", " list_arts.append(rand_i)\n", " indices.remove(rand_i)\n", " i += 1\n", " return list_arts" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# generate new list of article indices for labeling\n", "batchsize = 100\n", "label_next = pick_random_articles(batchsize)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "PLEASE READ THE FOLLOWING ARTICLES AND ENTER THE CORRESPONDING LABELS:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "for index in label_next:\n", " show_next(index)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print('Number of manual labels in round no. {}:'.format(m))\n", "print('0:{}, 1:{}, 2:{}'.format(len(df.loc[(df['Label'] == 0) & (df['Round'] == m)]), len(df.loc[(df['Label'] == 1) & (df['Round'] == m)]), len(df.loc[(df['Label'] == 2) & (df['Round'] == m)])))\n", "\n", "print('Number of articles to be corrected in this round: {}'.format(len(df.loc[(df['Label'] != -1) & (df['Estimated'] != -1) & (df['Round'] == m) & (df['Label'] != df['Estimated'])])))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# save intermediate status\n", "df.to_csv('../data/interactive_labeling_round_{}_temp.csv'.format(m),\n", " sep='|',\n", " mode='w',\n", " encoding='utf-8',\n", " quoting=csv.QUOTE_NONNUMERIC,\n", " quotechar='\\'')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#df.loc[df['Label'] != -1][:100]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part III: Model building and automated labeling" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# THIS CELL IS OPTIONAL\n", "\n", "# read current data set from csv\n", "m = \n", "df = pd.read_csv('../data/interactive_labeling_round_{}_temp.csv'.format(m),\n", " sep='|',\n", " usecols=range(1,13), # drop first column 'unnamed'\n", " encoding='utf-8',\n", " quoting=csv.QUOTE_NONNUMERIC,\n", " quotechar='\\'')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We build a classification model and check if it is possible to label articles automatically." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "# MNB: starting label propagation\n", "# BOW: extracting all words from articles...\n", "\n", "# BOW: making vocabulary of data set...\n", "\n", "# BOW: vocabulary consists of 14414 features.\n", "\n", "# MNB: fit training data and calculate matrix...\n", "\n", "# BOW: calculating matrix...\n", "\n", "# BOW: calculating frequencies...\n", "\n", "# MNB: transform testing data to matrix...\n", "\n", "# BOW: extracting all words from articles...\n", "\n", "# BOW: calculating matrix...\n", "\n", "# BOW: calculating frequencies...\n", "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "C:\\Users\\Anne\\Anaconda3\\lib\\site-packages\\sklearn\\semi_supervised\\label_propagation.py:205: RuntimeWarning: invalid value encountered in true_divide\n", " probabilities /= normalizer\n", "C:\\Users\\Anne\\Anaconda3\\lib\\site-packages\\sklearn\\semi_supervised\\label_propagation.py:205: RuntimeWarning: invalid value encountered in true_divide\n", " probabilities /= normalizer\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "# MNB: ending label propagation\n", "Wall time: 41min 56s\n" ] } ], "source": [ "# use sklearn's CountVectorizer\n", "cv = False\n", "\n", "# call script with manually labeled and manually unlabeled samples\n", "%time class_probs, predictions = LabelPropagation.propagate_labels(df.loc[df['Label'] != -1], df.loc[df['Label'] == -1], cv)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We label each article with class $j$, if its estimated probability for class $j$ is higher than our threshold:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[nan nan nan]\n", " [nan nan nan]\n", " [nan nan nan]\n", " [nan nan nan]\n", " [nan nan nan]\n", " [nan nan nan]\n", " [nan nan nan]\n", " [nan nan nan]\n", " [nan nan nan]\n", " [nan nan nan]\n", " [nan nan nan]\n", " [nan nan nan]\n", " [nan nan nan]\n", " [nan nan nan]\n", " [nan nan nan]\n", " [nan nan nan]\n", " [nan nan nan]\n", " [nan nan nan]\n", " [nan nan nan]\n", " [nan nan nan]\n", " [nan nan nan]\n", " [nan nan nan]\n", " [nan nan nan]\n", " [nan nan nan]\n", " [nan nan nan]\n", " [nan nan nan]\n", " [nan nan nan]\n", " [nan nan nan]\n", " [nan nan nan]\n", " [nan nan nan]\n", " [nan nan nan]\n", " [nan nan nan]\n", " [nan nan nan]\n", " [nan nan nan]\n", " [nan nan nan]\n", " [nan nan nan]\n", " [nan nan nan]\n", " [nan nan nan]\n", " [nan nan nan]\n", " [nan nan nan]\n", " [nan nan nan]\n", " [nan nan nan]\n", " [nan nan nan]\n", " [nan nan nan]\n", " [nan nan nan]\n", " [nan nan nan]\n", " [nan nan nan]\n", " [nan nan nan]\n", " [nan nan nan]\n", " [nan nan nan]\n", " [nan nan nan]\n", " [nan nan nan]\n", " [nan nan nan]\n", " [nan nan nan]\n", " [nan nan nan]\n", " [nan nan nan]\n", " [nan nan nan]\n", " [nan nan nan]\n", " [nan nan nan]\n", " [nan nan nan]\n", " [nan nan nan]\n", " [nan nan nan]\n", " [nan nan nan]\n", " [nan nan nan]\n", " [nan nan nan]\n", " [nan nan nan]\n", " [nan nan nan]\n", " [nan nan nan]\n", " [nan nan nan]\n", " [nan nan nan]\n", " [nan nan nan]\n", " [nan nan nan]\n", " [nan nan nan]\n", " [nan nan nan]\n", " [nan nan nan]\n", " [nan nan nan]\n", " [nan nan nan]\n", " [nan nan nan]\n", " [nan nan nan]\n", " [nan nan nan]\n", " [nan nan nan]\n", " [nan nan nan]\n", " [nan nan nan]\n", " [nan nan nan]\n", " [nan nan nan]\n", " [nan nan nan]\n", " [nan nan nan]\n", " [nan nan nan]\n", " [nan nan nan]\n", " [nan nan nan]\n", " [nan nan nan]\n", " [nan nan nan]\n", " [ 1. 0. 0.]\n", " [nan nan nan]\n", " [nan nan nan]\n", " [nan nan nan]\n", " [nan nan nan]\n", " [nan nan nan]\n", " [nan nan nan]\n", " [nan nan nan]]\n", "[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.\n", " 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.\n", " 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.\n", " 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.\n", " 0. 0. 0. 0.]\n" ] } ], "source": [ "print(class_probs[:100])\n", "print(predictions[:100])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# only labels with this minimum probability are adopted\n", "threshold = 0.99\n", "# dict for counting estimated labels\n", "estimated_labels = {0:0, 1:0, 2:0}\n", "\n", "# series of indices of recently estimated articles \n", "indices_estimated = df.loc[df['Label'] == -1, 'Index'].tolist()\n", "\n", "# for every row i and every element j in row i\n", "for (i,j), value in np.ndenumerate(class_probs):\n", " # check if probability of class i is not less than threshold\n", " if class_probs[i][j] > threshold:\n", " index = indices_estimated[i]\n", " # save estimated label\n", " df.loc[index, 'Estimated'] = classes[j]\n", " # annotate probability\n", " df.loc[index, 'Probability'] = value\n", " # count labels\n", " estimated_labels[int(classes[j])] += 1" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print('Number of auto-labeled samples in round {}: {}'.format(m, sum(estimated_labels.values())))\n", "print('Estimated labels: {}'.format(estimated_labels))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# THIS CELL IS OPTIONAL\n", "# let the Naive Bayes Algorithm test the quality of data set's labels\n", "\n", "# split data into text and label set\n", "X = df.loc[df['Label'] != -1, 'Title'] + '. ' + df.loc[df['Label'] != -1, 'Text']\n", "X = X.reset_index(drop=True)\n", "y = df.loc[df['Label'] != -1, 'Label']\n", "y = y.reset_index(drop=True)\n", "\n", "# use sklearn's CountVectorizer\n", "cv = False\n", "\n", "# call script with manually labeled and manually unlabeled samples\n", "#%time MNBInteractive.measure_mnb(X, y, cv)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print('End of this round (no. {}):'.format(m))\n", "print('Number of manually labeled articles: {}'.format(len(df.loc[df['Label'] != -1])))\n", "print('Number of manually unlabeled articles: {}'.format(len(df.loc[df['Label'] == -1])))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# save this round to csv\n", "df.to_csv('../data/interactive_labeling_round_{}.csv'.format(m),\n", " sep='|',\n", " mode='w',\n", " encoding='utf-8',\n", " quoting=csv.QUOTE_NONNUMERIC,\n", " quotechar='\\'')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "NOW PLEASE CONTINUE WITH PART II.\n", "REPEAT UNTIL ALL SAMPLES ARE LABELED." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.1" } }, "nbformat": 4, "nbformat_minor": 2 }