labeling analysis

2019-02-06 09:53:33 +01:00 · 2019-02-06 09:53:33 +01:00 · 146a292914
commit 146a292914
parent 97c2bde73d
19 changed files with 7499 additions and 16881 deletions
--- a/data/interactive_labeling_round_5.csv
+++ b/data/interactive_labeling_round_5.csv
--- a/data/interactive_labeling_round_6_temp.csv
+++ b/data/interactive_labeling_round_6_temp.csv
--- a/src/2019-01-29-al-interactive-labeling.ipynb
+++ b/src/2019-01-29-al-interactive-labeling.ipynb
--- a/src/2019-02-04-al-label-propagation.ipynb
+++ b/src/2019-02-04-al-label-propagation.ipynb
@ -0,0 +1,739 @@
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Jupyter Notebook for Interactive Labeling\n",
    "______\n",
    "\n",
    "This Jupyter Notebook combines a manual and automated labeling technique.\n",
    "It includes scikit learn's Label Propagation Algorithm.\n",
    "By calculating estimated class probabilities, we decide whether a news article has to be labeled manually or can be labeled automatically.\n",
    "For multiclass labeling, 3 classes are used.\n",
    "\n",
    "In each iteration we...\n",
    "- check/correct the next 100 article labels manually.\n",
    "  \n",
    "- apply the Label Propagation classification algorithm which returns a vector class_probs $(K_1, K_2, ... , K_6)$ per sample with the probabilities $K_i$ per class $i$. Estimated class labels are adopted automatically, if the estimated probability $K_x > 0.99$ with $x \\in {1,...,6}$.\n",
    "  \n",
    "Please note: User instructions are written in upper-case.\n",
    "__________\n",
    "Version: 2019-02-04, Anne Lorenz"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "import csv\n",
    "import operator\n",
    "import pickle\n",
    "import random\n",
    "\n",
    "from ipywidgets import interact, interactive, fixed, interact_manual\n",
    "import ipywidgets as widgets\n",
    "from IPython.core.interactiveshell import InteractiveShell\n",
    "from IPython.display import display\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "\n",
    "from LabelPropagation import LabelPropagation\n",
    "from MNBInteractive import MNBInteractive"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Part I: Data preparation\n",
    "\n",
    "First, we import our data set of 10 000 business news articles from a csv file.\n",
    "It contains 833/834 articles of each month of the year 2017.\n",
    "For detailed information regarding the data set, please read the full documentation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# round number to save intermediate label status of data set\n",
    "m = -1\n",
    "\n",
    "# initialize random => reproducible sequence\n",
    "random.seed(5)\n",
    "\n",
    "filepath = '../data/cleaned_data_set_without_header.csv'\n",
    "\n",
    "# set up wider display area\n",
    "pd.set_option('display.max_colwidth', -1)\n",
    "\n",
    "# set precision of output\n",
    "np.set_printoptions(precision=3)\n",
    "\n",
    "# show full text for print statement\n",
    "InteractiveShell.ast_node_interactivity = \"all\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of samples in data set in total: 10000\n"
     ]
    }
   ],
   "source": [
    "df = pd.read_csv(filepath,\n",
    "                 header=None,\n",
    "                 sep='|',\n",
    "                 engine='python',\n",
    "                 names = [\"Uuid\", \"Title\", \"Text\", \"Site\", \"SiteSection\", \"Url\", \"Timestamp\"],\n",
    "                 decimal='.',\n",
    "                 quotechar='\\'',\n",
    "                 quoting=csv.QUOTE_NONNUMERIC)\n",
    "\n",
    "# add column for indices\n",
    "df['Index'] = df.index.values.astype(int)\n",
    "\n",
    "# add round annotation (indicates labeling time)\n",
    "df['Round'] = np.nan\n",
    "\n",
    "# initialize label column with -1 for unlabeled samples\n",
    "df['Label'] = np.full((len(df)), -1).astype(int)\n",
    "\n",
    "# add column for estimated probability\n",
    "df['Probability'] = np.nan\n",
    "\n",
    "# store auto-estimated label, initialize with -1 for unestimated samples\n",
    "df['Estimated'] = np.full((len(df)), -1).astype(int)\n",
    "\n",
    "# row number\n",
    "n_rows = df.shape[0]\n",
    "print('Number of samples in data set in total: {}'.format(n_rows))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We load the previously created dictionary of all article indices (keys) with a list of mentioned organizations (values).\n",
    "In the following, we limit the number of occurences of a certain company name in all labeled articles to 3 to avoid imbalance."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "def show_next(index):\n",
    "    ''' this method displays an article's text and an interactive slider to set its label manually\n",
    "    '''\n",
    "    print('News article no. {}:'.format(index))\n",
    "    print()\n",
    "    print('HEADLINE:')\n",
    "    print(df.loc[df['Index'] == index, 'Title'])\n",
    "    print()\n",
    "    print('TEXT:')\n",
    "    print(df.loc[df['Index'] == index, 'Text'])\n",
    "    \n",
    "    def f(x):\n",
    "        # save user input\n",
    "        df.loc[df['Index'] == index, 'Label'] = x\n",
    "        df.loc[df['Index'] == index, 'Round'] = m\n",
    "\n",
    "    # create slider widget for labels\n",
    "    interact(f, x = widgets.IntSlider(min=-1, max=2, step=1, value=df.loc[df['Index'] == index, 'Estimated']))\n",
    "    print('0: Other/Unrelated news, 1: Merger,') \n",
    "    print('2: Topics related to deals, investments and mergers')\n",
    "    print('(e.g. merger pending/in talks/to be approved or merger rejected/aborted/denied or sale of unit or')\n",
    "    print('Share Deal/Asset Deal/acquisition or merger as incidental remark/not main topic/not current or speculative)')\n",
    "    print('___________________________________________________________________________________________________________')\n",
    "    print()\n",
    "    print()\n",
    "\n",
    "# list of article indices that will be shown next\n",
    "label_next = []"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "# global dict of all articles (article index => list of mentioned organizations)\n",
    "dict_art_orgs = {}\n",
    "with open('../obj/dict_articles_organizations_without_banks.pkl', 'rb') as input:\n",
    "        dict_art_orgs = pickle.load(input)\n",
    "\n",
    "# global dict of mentioned companies in labeled articles (company name => number of occurences\n",
    "dict_limit = {}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The iteration part starts here:\n",
    "\n",
    "## Part II: Manual checking of estimated labels"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "PLEASE INSERT M MANUALLY IF PROCESS HAS BEEN INTERRUPTED BEFORE."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "m = 9"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Last round number: 9\n",
      "Number of manually labeled articles: 1000\n",
      "Number of manually unlabeled articles: 9000\n"
     ]
    }
   ],
   "source": [
    "# read current data set from csv\n",
    "df = pd.read_csv('../data/interactive_labeling_round_{}.csv'.format(m),\n",
    "          sep='|',\n",
    "          usecols=range(1,13), # drop first column 'unnamed'\n",
    "          encoding='utf-8',\n",
    "          quoting=csv.QUOTE_NONNUMERIC,\n",
    "          quotechar='\\'')\n",
    "\n",
    "# find current iteration/round number\n",
    "m = int(df['Round'].max())\n",
    "print('Last round number: {}'.format(m))\n",
    "print('Number of manually labeled articles: {}'.format(len(df.loc[df['Label'] != -1])))\n",
    "print('Number of manually unlabeled articles: {}'.format(len(df.loc[df['Label'] == -1])))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "# initialize dict_limit\n",
    "df_labeled = df[df['Label'] != -1]\n",
    "\n",
    "for index in df_labeled['Index']:\n",
    "    orgs = dict_art_orgs[index]\n",
    "    for org in orgs:\n",
    "        if org in dict_limit:\n",
    "            dict_limit[org] += 1\n",
    "        else:\n",
    "            dict_limit[org] = 1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# OPTIONAL:\n",
    "# print organizations that are mentioned 3 times and therefore limited\n",
    "for k, v in dict_limit.items():\n",
    "    if v == 3:\n",
    "        print(k)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We now check (and correct if necessary) the next 100 auto-labeled articles."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "if m == -1:\n",
    "    indices = list(range(10000))\n",
    "else:\n",
    "    # indices of recently auto-labeled articles\n",
    "    indices = df.loc[(df['Estimated'] != -1) & (df['Label'] == -1), 'Index'].tolist()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# increment round number\n",
    "m += 1\n",
    "print('This round number: {}'.format(m))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def pick_random_articles(n, limit = 3):\n",
    "    ''' pick n random articles, check if company occurences under limit.\n",
    "    returns list of n indices of the articles we can label next.\n",
    "    '''\n",
    "    # labeling list\n",
    "    list_arts = []\n",
    "    # article counter\n",
    "    i = 0\n",
    "    while i < n:\n",
    "        # pick random article\n",
    "        rand_i = random.choice(indices)\n",
    "        # list of companies in that article\n",
    "        companies = dict_art_orgs[rand_i]\n",
    "        if all((dict_limit.get(company) == None) or (dict_limit[company] < limit ) for company in companies): \n",
    "            for company in companies:\n",
    "                if company in dict_limit:\n",
    "                    dict_limit[company] += 1\n",
    "                else:\n",
    "                    dict_limit[company] = 1\n",
    "            # add article to labeling list\n",
    "            list_arts.append(rand_i)\n",
    "            indices.remove(rand_i)\n",
    "            i += 1\n",
    "    return list_arts"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# generate new list of article indices for labeling\n",
    "batchsize = 100\n",
    "label_next = pick_random_articles(batchsize)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "PLEASE READ THE FOLLOWING ARTICLES AND ENTER THE CORRESPONDING LABELS:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "for index in label_next:\n",
    "    show_next(index)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print('Number of manual labels in round no. {}:'.format(m))\n",
    "print('0:{}, 1:{}, 2:{}'.format(len(df.loc[(df['Label'] == 0) & (df['Round'] == m)]), len(df.loc[(df['Label'] == 1) & (df['Round'] == m)]), len(df.loc[(df['Label'] == 2) & (df['Round'] == m)])))\n",
    "\n",
    "print('Number of articles to be corrected in this round: {}'.format(len(df.loc[(df['Label'] != -1) & (df['Estimated'] != -1) & (df['Round'] == m) & (df['Label'] != df['Estimated'])])))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# save intermediate status\n",
    "df.to_csv('../data/interactive_labeling_round_{}_temp.csv'.format(m),\n",
    "      sep='|',\n",
    "      mode='w',\n",
    "      encoding='utf-8',\n",
    "      quoting=csv.QUOTE_NONNUMERIC,\n",
    "      quotechar='\\'')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#df.loc[df['Label'] != -1][:100]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Part III: Model building and automated labeling"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# THIS CELL IS OPTIONAL\n",
    "\n",
    "# read current data set from csv\n",
    "m = \n",
    "df = pd.read_csv('../data/interactive_labeling_round_{}_temp.csv'.format(m),\n",
    "          sep='|',\n",
    "          usecols=range(1,13), # drop first column 'unnamed'\n",
    "          encoding='utf-8',\n",
    "          quoting=csv.QUOTE_NONNUMERIC,\n",
    "          quotechar='\\'')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We build a classification model and check if it is possible to label articles automatically."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "# MNB: starting label propagation\n",
      "# BOW: extracting all words from articles...\n",
      "\n",
      "# BOW: making vocabulary of data set...\n",
      "\n",
      "# BOW: vocabulary consists of 14414 features.\n",
      "\n",
      "# MNB: fit training data and calculate matrix...\n",
      "\n",
      "# BOW: calculating matrix...\n",
      "\n",
      "# BOW: calculating frequencies...\n",
      "\n",
      "# MNB: transform testing data to matrix...\n",
      "\n",
      "# BOW: extracting all words from articles...\n",
      "\n",
      "# BOW: calculating matrix...\n",
      "\n",
      "# BOW: calculating frequencies...\n",
      "\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "C:\\Users\\Anne\\Anaconda3\\lib\\site-packages\\sklearn\\semi_supervised\\label_propagation.py:205: RuntimeWarning: invalid value encountered in true_divide\n",
      "  probabilities /= normalizer\n",
      "C:\\Users\\Anne\\Anaconda3\\lib\\site-packages\\sklearn\\semi_supervised\\label_propagation.py:205: RuntimeWarning: invalid value encountered in true_divide\n",
      "  probabilities /= normalizer\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "# MNB: ending label propagation\n",
      "Wall time: 41min 56s\n"
     ]
    }
   ],
   "source": [
    "# use sklearn's CountVectorizer\n",
    "cv = False\n",
    "\n",
    "# call script with manually labeled and manually unlabeled samples\n",
    "%time class_probs, predictions = LabelPropagation.propagate_labels(df.loc[df['Label'] != -1], df.loc[df['Label'] == -1], cv)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We label each article with class $j$, if its estimated probability for class $j$ is higher than our threshold:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[nan nan nan]\n",
      " [nan nan nan]\n",
      " [nan nan nan]\n",
      " [nan nan nan]\n",
      " [nan nan nan]\n",
      " [nan nan nan]\n",
      " [nan nan nan]\n",
      " [nan nan nan]\n",
      " [nan nan nan]\n",
      " [nan nan nan]\n",
      " [nan nan nan]\n",
      " [nan nan nan]\n",
      " [nan nan nan]\n",
      " [nan nan nan]\n",
      " [nan nan nan]\n",
      " [nan nan nan]\n",
      " [nan nan nan]\n",
      " [nan nan nan]\n",
      " [nan nan nan]\n",
      " [nan nan nan]\n",
      " [nan nan nan]\n",
      " [nan nan nan]\n",
      " [nan nan nan]\n",
      " [nan nan nan]\n",
      " [nan nan nan]\n",
      " [nan nan nan]\n",
      " [nan nan nan]\n",
      " [nan nan nan]\n",
      " [nan nan nan]\n",
      " [nan nan nan]\n",
      " [nan nan nan]\n",
      " [nan nan nan]\n",
      " [nan nan nan]\n",
      " [nan nan nan]\n",
      " [nan nan nan]\n",
      " [nan nan nan]\n",
      " [nan nan nan]\n",
      " [nan nan nan]\n",
      " [nan nan nan]\n",
      " [nan nan nan]\n",
      " [nan nan nan]\n",
      " [nan nan nan]\n",
      " [nan nan nan]\n",
      " [nan nan nan]\n",
      " [nan nan nan]\n",
      " [nan nan nan]\n",
      " [nan nan nan]\n",
      " [nan nan nan]\n",
      " [nan nan nan]\n",
      " [nan nan nan]\n",
      " [nan nan nan]\n",
      " [nan nan nan]\n",
      " [nan nan nan]\n",
      " [nan nan nan]\n",
      " [nan nan nan]\n",
      " [nan nan nan]\n",
      " [nan nan nan]\n",
      " [nan nan nan]\n",
      " [nan nan nan]\n",
      " [nan nan nan]\n",
      " [nan nan nan]\n",
      " [nan nan nan]\n",
      " [nan nan nan]\n",
      " [nan nan nan]\n",
      " [nan nan nan]\n",
      " [nan nan nan]\n",
      " [nan nan nan]\n",
      " [nan nan nan]\n",
      " [nan nan nan]\n",
      " [nan nan nan]\n",
      " [nan nan nan]\n",
      " [nan nan nan]\n",
      " [nan nan nan]\n",
      " [nan nan nan]\n",
      " [nan nan nan]\n",
      " [nan nan nan]\n",
      " [nan nan nan]\n",
      " [nan nan nan]\n",
      " [nan nan nan]\n",
      " [nan nan nan]\n",
      " [nan nan nan]\n",
      " [nan nan nan]\n",
      " [nan nan nan]\n",
      " [nan nan nan]\n",
      " [nan nan nan]\n",
      " [nan nan nan]\n",
      " [nan nan nan]\n",
      " [nan nan nan]\n",
      " [nan nan nan]\n",
      " [nan nan nan]\n",
      " [nan nan nan]\n",
      " [nan nan nan]\n",
      " [ 1.  0.  0.]\n",
      " [nan nan nan]\n",
      " [nan nan nan]\n",
      " [nan nan nan]\n",
      " [nan nan nan]\n",
      " [nan nan nan]\n",
      " [nan nan nan]\n",
      " [nan nan nan]]\n",
      "[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.\n",
      " 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.\n",
      " 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.\n",
      " 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.\n",
      " 0. 0. 0. 0.]\n"
     ]
    }
   ],
   "source": [
    "print(class_probs[:100])\n",
    "print(predictions[:100])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# only labels with this minimum probability are adopted\n",
    "threshold = 0.99\n",
    "# dict for counting estimated labels\n",
    "estimated_labels = {0:0, 1:0, 2:0}\n",
    "\n",
    "# series of indices of recently estimated articles \n",
    "indices_estimated = df.loc[df['Label'] == -1, 'Index'].tolist()\n",
    "\n",
    "# for every row i and every element j in row i\n",
    "for (i,j), value in np.ndenumerate(class_probs):\n",
    "    # check if probability of class i is not less than threshold\n",
    "    if class_probs[i][j] > threshold:\n",
    "        index = indices_estimated[i]\n",
    "        # save estimated label\n",
    "        df.loc[index, 'Estimated'] = classes[j]\n",
    "        # annotate probability\n",
    "        df.loc[index, 'Probability'] = value\n",
    "        # count labels\n",
    "        estimated_labels[int(classes[j])] += 1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print('Number of auto-labeled samples in round {}: {}'.format(m, sum(estimated_labels.values())))\n",
    "print('Estimated labels: {}'.format(estimated_labels))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# THIS CELL IS OPTIONAL\n",
    "# let the Naive Bayes Algorithm test the quality of data set's labels\n",
    "\n",
    "# split data into text and label set\n",
    "X = df.loc[df['Label'] != -1, 'Title'] + '. ' + df.loc[df['Label'] != -1, 'Text']\n",
    "X = X.reset_index(drop=True)\n",
    "y = df.loc[df['Label'] != -1, 'Label']\n",
    "y = y.reset_index(drop=True)\n",
    "\n",
    "# use sklearn's CountVectorizer\n",
    "cv = False\n",
    "\n",
    "# call script with manually labeled and manually unlabeled samples\n",
    "#%time MNBInteractive.measure_mnb(X, y, cv)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print('End of this round (no. {}):'.format(m))\n",
    "print('Number of manually labeled articles: {}'.format(len(df.loc[df['Label'] != -1])))\n",
    "print('Number of manually unlabeled articles: {}'.format(len(df.loc[df['Label'] == -1])))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# save this round to csv\n",
    "df.to_csv('../data/interactive_labeling_round_{}.csv'.format(m),\n",
    "      sep='|',\n",
    "      mode='w',\n",
    "      encoding='utf-8',\n",
    "      quoting=csv.QUOTE_NONNUMERIC,\n",
    "      quotechar='\\'')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "NOW PLEASE CONTINUE WITH PART II.\n",
    "REPEAT UNTIL ALL SAMPLES ARE LABELED."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
 }
--- a/src/2019-02-06-al-labeling-analysis.ipynb
+++ b/src/2019-02-06-al-labeling-analysis.ipynb
@ -0,0 +1,267 @@
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Jupyter Notebook for Interactive Labeling\n",
    "______\n",
    "\n",
    "This Jupyter Notebook is only for data analysis.\n",
    "__________\n",
    "Version: 2019-02-06, Anne Lorenz"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "import csv\n",
    "\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "\n",
    "from BagOfWords import BagOfWords\n",
    "from MNBInteractive import MNBInteractive"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We load the previously created dictionary of all article indices (keys) with a list of mentioned organizations (values).\n",
    "In the following, we limit the number of occurences of a certain company name in all labeled articles to 3 to avoid imbalance."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "PLEASE INSERT M MANUALLY:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "m = 9"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Last round number: 9\n",
      "Number of manually labeled articles: 1000\n",
      "Number of manually unlabeled articles: 9000\n"
     ]
    }
   ],
   "source": [
    "# read current data set from csv\n",
    "df = pd.read_csv('../data/interactive_labeling_round_{}.csv'.format(m),\n",
    "          sep='|',\n",
    "          usecols=range(1,13), # drop first column 'unnamed'\n",
    "          encoding='utf-8',\n",
    "          quoting=csv.QUOTE_NONNUMERIC,\n",
    "          quotechar='\\'')\n",
    "\n",
    "# find current iteration/round number\n",
    "m = int(df['Round'].max())\n",
    "print('Last round number: {}'.format(m))\n",
    "print('Number of manually labeled articles: {}'.format(len(df.loc[df['Label'] != -1])))\n",
    "print('Number of manually unlabeled articles: {}'.format(len(df.loc[df['Label'] == -1])))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we apply Multinomial Naive Bayes to calculate the resubstitution error."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train_test = df.loc[df['Label'] != -1, 'Title'] + ' ' + df.loc[df['Label'] != -1, 'Text']\n",
    "y_train_test = df.loc[df['Label'] != -1, 'Label']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "# BOW: extracting all words from articles...\n",
      "\n",
      "# BOW: making vocabulary of data set...\n",
      "\n",
      "# BOW: vocabulary consists of 14414 features.\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# use my own BagOfWords python implementation\n",
    "stemming = True\n",
    "rel_freq = True\n",
    "extracted_words = BagOfWords.extract_all_words(X_train_test)\n",
    "vocab = BagOfWords.make_vocab(extracted_words)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# fit the training data and return the matrix\n",
    "training_data = BagOfWords.make_matrix(extracted_words, vocab, rel_freq, stemming)\n",
    "testing_data = training_data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Naive Bayes\n",
    "classifier = MultinomialNB(alpha=1.0e-10, fit_prior=False, class_prior=None)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# fit classifier\n",
    "classifier.fit(training_data, y_train_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "# BOW: extracting all words from articles...\n",
      "\n",
      "# BOW: making vocabulary of data set...\n",
      "\n",
      "# BOW: vocabulary consists of 14414 features.\n",
      "\n",
      "# BOW: calculating matrix...\n",
      "\n",
      "# BOW: calculating frequencies...\n",
      "\n",
      "Errors at index:\n",
      "\n"
     ]
    },
    {
     "ename": "KeyError",
     "evalue": "0",
     "output_type": "error",
     "traceback": [
      "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[1;31mKeyError\u001b[0m                                  Traceback (most recent call last)",
      "\u001b[1;32m<timed eval>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m\u001b[0m\n",
      "\u001b[1;32m~\\BA\\Python\\src\\MNBInteractive.py\u001b[0m in \u001b[0;36manalyze_errors\u001b[1;34m(dataset, sklearn_cv)\u001b[0m\n\u001b[0;32m    252\u001b[0m                 \u001b[0mn\u001b[0m \u001b[1;33m=\u001b[0m \u001b[1;36m0\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    253\u001b[0m                 \u001b[1;32mfor\u001b[0m \u001b[0mi\u001b[0m \u001b[1;32min\u001b[0m \u001b[0mrange\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mlen\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0my_train_test\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 254\u001b[1;33m                         \u001b[1;32mif\u001b[0m \u001b[0my_train_test\u001b[0m\u001b[1;33m[\u001b[0m\u001b[0mi\u001b[0m\u001b[1;33m]\u001b[0m \u001b[1;33m!=\u001b[0m \u001b[0mpredictions\u001b[0m\u001b[1;33m[\u001b[0m\u001b[0mi\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m    255\u001b[0m                                 \u001b[0mn\u001b[0m \u001b[1;33m+=\u001b[0m \u001b[1;36m1\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    256\u001b[0m                                 \u001b[0mprint\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m'error no.{}'\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mformat\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mn\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
      "\u001b[1;32m~\\Anaconda3\\lib\\site-packages\\pandas\\core\\series.py\u001b[0m in \u001b[0;36m__getitem__\u001b[1;34m(self, key)\u001b[0m\n\u001b[0;32m    765\u001b[0m         \u001b[0mkey\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mcom\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_apply_if_callable\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    766\u001b[0m         \u001b[1;32mtry\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 767\u001b[1;33m             \u001b[0mresult\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mindex\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mget_value\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mkey\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m    768\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    769\u001b[0m             \u001b[1;32mif\u001b[0m \u001b[1;32mnot\u001b[0m \u001b[0mis_scalar\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mresult\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
      "\u001b[1;32m~\\Anaconda3\\lib\\site-packages\\pandas\\core\\indexes\\base.py\u001b[0m in \u001b[0;36mget_value\u001b[1;34m(self, series, key)\u001b[0m\n\u001b[0;32m   3116\u001b[0m         \u001b[1;32mtry\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m   3117\u001b[0m             return self._engine.get_value(s, k,\n\u001b[1;32m-> 3118\u001b[1;33m                                           tz=getattr(series.dtype, 'tz', None))\n\u001b[0m\u001b[0;32m   3119\u001b[0m         \u001b[1;32mexcept\u001b[0m \u001b[0mKeyError\u001b[0m \u001b[1;32mas\u001b[0m \u001b[0me1\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m   3120\u001b[0m             \u001b[1;32mif\u001b[0m \u001b[0mlen\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m)\u001b[0m \u001b[1;33m>\u001b[0m \u001b[1;36m0\u001b[0m \u001b[1;32mand\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0minferred_type\u001b[0m \u001b[1;32min\u001b[0m \u001b[1;33m[\u001b[0m\u001b[1;34m'integer'\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;34m'boolean'\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
      "\u001b[1;32mpandas\\_libs\\index.pyx\u001b[0m in \u001b[0;36mpandas._libs.index.IndexEngine.get_value\u001b[1;34m()\u001b[0m\n",
      "\u001b[1;32mpandas\\_libs\\index.pyx\u001b[0m in \u001b[0;36mpandas._libs.index.IndexEngine.get_value\u001b[1;34m()\u001b[0m\n",
      "\u001b[1;32mpandas\\_libs\\index.pyx\u001b[0m in \u001b[0;36mpandas._libs.index.IndexEngine.get_loc\u001b[1;34m()\u001b[0m\n",
      "\u001b[1;32mpandas\\_libs\\hashtable_class_helper.pxi\u001b[0m in \u001b[0;36mpandas._libs.hashtable.Int64HashTable.get_item\u001b[1;34m()\u001b[0m\n",
      "\u001b[1;32mpandas\\_libs\\hashtable_class_helper.pxi\u001b[0m in \u001b[0;36mpandas._libs.hashtable.Int64HashTable.get_item\u001b[1;34m()\u001b[0m\n",
      "\u001b[1;31mKeyError\u001b[0m: 0"
     ]
    }
   ],
   "source": [
    "# Predict class\n",
    "predictions = classifier.predict(testing_data)\n",
    "print('Errors at index:')\n",
    "print()\n",
    "n = 0\n",
    "for i in range(len(y_train_test)):\n",
    "    if y_train_test[i] != predictions[i]:\n",
    "        n += 1\n",
    "        print('error no.{}'.format(n))\n",
    "        print('prediction at index {} is: {}, but actual is: {}'\n",
    "        .format(i, predictions[i], y_train_test[i]))\n",
    "        print(X_train_test[i])\n",
    "        print(y_train_test[i])\n",
    "        print()\n",
    "#print metrics\n",
    "print('F1 score: ', format(f1_score(y_train_test, predictions)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# THIS CELL IS OPTIONAL\n",
    "# let the Naive Bayes Algorithm test the quality of data set's labels\n",
    "\n",
    "# split data into text and label set\n",
    "X = df.loc[df['Label'] != -1, 'Title'] + '. ' + df.loc[df['Label'] != -1, 'Text']\n",
    "X = X.reset_index(drop=True)\n",
    "y = df.loc[df['Label'] != -1, 'Label']\n",
    "y = y.reset_index(drop=True)\n",
    "\n",
    "# use sklearn's CountVectorizer\n",
    "cv = False\n",
    "\n",
    "# call script with manually labeled and manually unlabeled samples\n",
    "%time MNBInteractive.measure_mnb(X, y, cv)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
 }
--- a/src/BagOfWords.py
+++ b/src/BagOfWords.py
@ -53,7 +53,7 @@ class BagOfWords:
        for word in words:
            word = word.lower()
            # check if alphabetic and not stop word
-            if (word.isalpha() and word not in stop_words):
+            if (word.isalpha()):# and word not in stop_words):
                if stemming:
                    # reduce word to its stem
                    word = stemmer.stem(word)
--- a/src/LabelPropagation.py
+++ b/src/LabelPropagation.py
@ -0,0 +1,86 @@
 '''
 Label Propagation Algorithm for Interactive Labeling
 ====================================================
 Uses scikit learn's implementation of label propagation:
 Xiaojin Zhu and Zoubin Ghahramani. Learning from labeled and unlabeled
 data with label propagation.
 (Technical Report CMU-CALD-02-107, Carnegie Mellon University, 2002.)
 Prints out probabilities for classes needed for interactive labeling.
 '''
 from BagOfWords import BagOfWords
 import pandas as pd
 from sklearn.feature_extraction.text import CountVectorizer
 from sklearn.metrics import recall_score, precision_score
 from sklearn.semi_supervised import label_propagation
 class LabelPropagation:
 	def propagate_labels(labeled_data, unlabeled_data, sklearn_cv=False):
 		print('# MNB: starting label propagation')
 		# assign algorithm
 		classifier = label_propagation.LabelSpreading()
 		# split labeled data into text and label set
 		# join title and text
 		X = labeled_data['Title'] + '. ' + labeled_data['Text']
 		y = labeled_data['Label']
 		# split unlabeled data into text and label set
 		# join title and text
 		U = unlabeled_data['Title'] + '. ' + unlabeled_data['Text']
 		l = unlabeled_data['Label']
 		if sklearn_cv:
 				cv = CountVectorizer()
 		# probabilities of each class (of each fold)
 		class_probs = []
 		# number of training samples observed in each class 
 		class_counts = []
 		if sklearn_cv:
 			# fit the training data and then return the matrix
 			training_data = cv.fit_transform(X, y).toarray()
 			# transform testing data and return the matrix
 			testing_data = cv.transform(U).toarray()
 		else:
 			# use my own BagOfWords python implementation
 			stemming = True
 			rel_freq = False
 			extracted_words = BagOfWords.extract_all_words(X)
 			vocab = BagOfWords.make_vocab(extracted_words)
 			# fit the training data and then return the matrix
 			print('# MNB: fit training data and calculate matrix...')
 			print()
 			training_data = BagOfWords.make_matrix(extracted_words,
 								 vocab, rel_freq, stemming)
 			# transform testing data and return the matrix
 			print('# MNB: transform testing data to matrix...')
 			print()
 			extracted_words = BagOfWords.extract_all_words(U)
 			testing_data = BagOfWords.make_matrix(extracted_words,
 								 vocab, rel_freq, stemming)
 		#fit classifier
 		classifier.fit(training_data, y)
 		# probability estimates for the test vector (testing_data)
 		class_probs = classifier.predict_proba(testing_data)
 		predictions = classifier.predict(testing_data)
 		print('# MNB: ending label propagation')
 		# return vector of class estimates
 		return class_probs, predictions
--- a/src/LabelingPlotter.py
+++ b/src/LabelingPlotter.py
@ -0,0 +1,42 @@
 import matplotlib
 import matplotlib.pyplot as plt
 import numpy as np
 # round numbers
 round = [0,1,2,3,4,5,6,7,8,9]
 # number of wrong estimated labels per round
 wrong = [0/100, 19/100, 17/100, 16/100, 20/100, 12/100, 10/100, 20/100, 14/100, 12/100]
 # number of manual classified articles per class and round 
 man_0 = [84, 165, 247, 329, 410, 498, 586, 662, 741, 821]
 man_1 = [3, 7, 12, 16, 20, 22, 23, 29, 37, 39]
 man_2 = [13, 28, 41, 55, 70, 80, 91, 109, 122, 140]
 # number of estimated labels per class and round
 est_0 = [9873/9900, 9757/9800, 9603/9700, 9470/9600, 9735/9500, 9238/9400, 9107/9300, 8007/9200, 8064/9100, 7641/9000]
 est_1 = [14/9900, 15/9800, 11/9700, 11/9600, 16/9500, 17/9400, 18/9300, 19/9200, 18/9100, 20/9000]
 est_2 = [12/9900, 26/9800, 77/9700, 94/9600, 380/9500, 123/9400, 147/9300, 676/9200, 595/9100, 837/9000]
 fig, ax = plt.subplots(3, 1)
 ax[0].plot(round, wrong)
 ax[0].set_xlabel('# round')
 ax[0].set_ylabel('# false rate')
 ax[1].plot(round, man_0, round, man_1, round, man_2)
 ax[1].set_ylabel('# manually labeled')
 ax[2].plot(round, est_0, round, est_1, round, est_2)
 ax[2].set_ylabel('# estimated articles')
 fig.tight_layout()
 #plt.savefig('..\\visualization\\Labeling_1.png')
 plt.show()
 #cxy, f = axs[1].cohere(s1, s2, 256, 1. / dt)
 # format axis labels for thousends (e.g. '10,000')
 #plt.gca().yaxis.set_major_formatter(matplotlib.ticker\
 	#.FuncFormatter(lambda x, p: format(int(x), ',')))
--- a/src/MNBInteractive.py
+++ b/src/MNBInteractive.py
@ -16,196 +16,248 @@ from sklearn.naive_bayes import MultinomialNB
 class MNBInteractive:
-    '''NOTE: The multinomial distribution normally requires integer feature counts.
+	'''NOTE: The multinomial distribution normally requires integer feature counts.
-    However, in practice, fractional counts such as tf-idf may also work.
+	However, in practice, fractional counts such as tf-idf may also work.
-    '''
+	'''
-    def estimate_mnb(labeled_data, unlabeled_data, sklearn_cv=False):
+	def estimate_mnb(labeled_data, unlabeled_data, sklearn_cv=False):
-        '''fits naive bayes model
+		'''fits naive bayes model
-        '''
+		'''
-        print('# MNB: starting multinomial naives bayes...')
+		print('# MNB: starting multinomial naives bayes...')
-        print()
+		print()
-        # split labeled data into text and label set
+		# split labeled data into text and label set
-        # join title and text
+		# join title and text
-        X = labeled_data['Title'] + '. ' + labeled_data['Text']
+		X = labeled_data['Title'] + '. ' + labeled_data['Text']
-        y = labeled_data['Label']
+		y = labeled_data['Label']
-        # split unlabeled data into text and label set
+		# split unlabeled data into text and label set
-        # join title and text
+		# join title and text
-        U = unlabeled_data['Title'] + '. ' + unlabeled_data['Text']
+		U = unlabeled_data['Title'] + '. ' + unlabeled_data['Text']
-        l = unlabeled_data['Label']
+		l = unlabeled_data['Label']
-        if sklearn_cv:
+		if sklearn_cv:
-            cv = CountVectorizer()
+			cv = CountVectorizer()
-        # fit_prior=False: a uniform prior will be used instead
+		# fit_prior=False: a uniform prior will be used instead
-        # of learning class prior probabilities
+		# of learning class prior probabilities
-        classifier = MultinomialNB(alpha=1.0e-10,
+		classifier = MultinomialNB(alpha=1.0e-10,
-                                   fit_prior=False,
+								   fit_prior=False,
-                                   class_prior=None)
+								   class_prior=None)
-        # metrics
+		# metrics
-        recall_scores = []
+		recall_scores = []
-        precision_scores = []
+		precision_scores = []
-        f1_scores = []
+		f1_scores = []
-        # probabilities of each class (of each fold)
+		# probabilities of each class (of each fold)
-        class_probs = []
+		class_probs = []
-        # number of training samples observed in each class 
+		# number of training samples observed in each class 
-        class_counts = []
+		class_counts = []
-        if sklearn_cv:
+		if sklearn_cv:
-            # use sklearn CountVectorizer
+			# use sklearn CountVectorizer
-            # fit the training data and then return the matrix
+			# fit the training data and then return the matrix
-            training_data = cv.fit_transform(X, y).toarray()
+			training_data = cv.fit_transform(X, y).toarray()
-            # transform testing data and return the matrix
+			# transform testing data and return the matrix
-            testing_data = cv.transform(U).toarray()
+			testing_data = cv.transform(U).toarray()
-        else:
+		else:
-            # use my own BagOfWords python implementation
+			# use my own BagOfWords python implementation
-            stemming = True
+			stemming = True
-            rel_freq = False
+			rel_freq = False
-            extracted_words = BagOfWords.extract_all_words(X)
+			extracted_words = BagOfWords.extract_all_words(X)
-            vocab = BagOfWords.make_vocab(extracted_words)
+			vocab = BagOfWords.make_vocab(extracted_words)
-            # fit the training data and then return the matrix
+			# fit the training data and then return the matrix
-            print('# MNB: fit training data and calculate matrix...')
+			print('# MNB: fit training data and calculate matrix...')
-            print()
+			print()
-            training_data = BagOfWords.make_matrix(extracted_words,
+			training_data = BagOfWords.make_matrix(extracted_words,
-                            vocab, rel_freq, stemming)
+							vocab, rel_freq, stemming)
-            # transform testing data and return the matrix
+			# transform testing data and return the matrix
-            print('# MNB: transform testing data to matrix...')
+			print('# MNB: transform testing data to matrix...')
-            print()
+			print()
-            extracted_words = BagOfWords.extract_all_words(U)
+			extracted_words = BagOfWords.extract_all_words(U)
-            testing_data = BagOfWords.make_matrix(extracted_words,
+			testing_data = BagOfWords.make_matrix(extracted_words,
-                            vocab, rel_freq, stemming)
+							vocab, rel_freq, stemming)
-        #fit classifier
+		#fit classifier
-        classifier.fit(training_data, y)
+		classifier.fit(training_data, y)
-        
+		
-        # probability estimates for the test vector (testing_data)
+		# probability estimates for the test vector (testing_data)
-        class_probs = classifier.predict_proba(testing_data)
+		class_probs = classifier.predict_proba(testing_data)
-        # number of samples encountered for each class during fitting
+		# number of samples encountered for each class during fitting
-        # this value is weighted by the sample weight when provided
+		# this value is weighted by the sample weight when provided
-        class_count = classifier.class_count_
+		class_count = classifier.class_count_
-        # classes in order used
+		# classes in order used
-        classes = classifier.classes_
+		classes = classifier.classes_
-        print('# MNB: ending multinomial naive bayes')
+		print('# MNB: ending multinomial naive bayes')
-        # return classes and vector of class estimates
+		# return classes and vector of class estimates
-        return classes, class_count, class_probs
+		return classes, class_count, class_probs
-        
+		
-    def measure_mnb(X, y, sklearn_cv=False, percentile=100):
+	def measure_mnb(X, y, sklearn_cv=False, percentile=100):
-        '''fits multinomial naive bayes model
+		'''fits multinomial naive bayes model
-        '''
+		'''
-        print('# fitting model')
+		print('# fitting model')
-        print('# ...')
+		print('# ...')
-        if sklearn_cv:
+		if sklearn_cv:
-            cv = CountVectorizer()
+			cv = CountVectorizer()
-        # use stratified k-fold cross-validation as split method
+		# use stratified k-fold cross-validation as split method
-        skf = StratifiedKFold(n_splits = 2, shuffle=True, random_state=5)
+		skf = StratifiedKFold(n_splits = 2, shuffle=True, random_state=5)
-        classifier = MultinomialNB(alpha=1.0e-10,
+		classifier = MultinomialNB(alpha=1.0e-10,
-                                   fit_prior=False,
+								   fit_prior=False,
-                                   class_prior=None)
+								   class_prior=None)
-        # metrics
+		# metrics
-        recall_scores = []
+		recall_scores = []
-        precision_scores = []
+		precision_scores = []
-        f1_scores = []
+		f1_scores = []
-        # probabilities of each class (of each fold)
+		# probabilities of each class (of each fold)
-        class_prob = []
+		class_prob = []
-        # counts number of training samples observed in each class 
+		# counts number of training samples observed in each class 
-        class_counts = []
+		class_counts = []
-        # for each fold
+		# for each fold
-        n = 0
+		n = 0
-        for train, test in skf.split(X,y):
+		for train, test in skf.split(X,y):
-            n += 1
+			n += 1
-            print('# split no. ' + str(n))
+			print('# split no. ' + str(n))
-            if sklearn_cv:
+			if sklearn_cv:
-                # use sklearn CountVectorizer
+				# use sklearn CountVectorizer
-                # fit the training data and then return the matrix
+				# fit the training data and then return the matrix
-                training_data = cv.fit_transform(X[train], y[train]).toarray()
+				training_data = cv.fit_transform(X[train], y[train]).toarray()
-                # transform testing data and return the matrix
+				# transform testing data and return the matrix
-                testing_data = cv.transform(X[test]).toarray()
+				testing_data = cv.transform(X[test]).toarray()
-            else:
+			else:
-                # use my own BagOfWords python implementation
+				# use my own BagOfWords python implementation
-                stemming = True
+				stemming = True
-                rel_freq = True
+				rel_freq = True
-                extracted_words = BagOfWords.extract_all_words(X[train])
+				extracted_words = BagOfWords.extract_all_words(X[train])
-                vocab = BagOfWords.make_vocab(extracted_words)
+				vocab = BagOfWords.make_vocab(extracted_words)
-                # fit the training data and then return the matrix
+				# fit the training data and then return the matrix
-                training_data = BagOfWords.make_matrix(extracted_words,
+				training_data = BagOfWords.make_matrix(extracted_words,
-                                vocab, rel_freq, stemming)
+								vocab, rel_freq, stemming)
-                # transform testing data and return the matrix
+				# transform testing data and return the matrix
-                extracted_words = BagOfWords.extract_all_words(X[test])
+				extracted_words = BagOfWords.extract_all_words(X[test])
-                testing_data = BagOfWords.make_matrix(extracted_words,
+				testing_data = BagOfWords.make_matrix(extracted_words,
-                                vocab, rel_freq, stemming)
+								vocab, rel_freq, stemming)
-            # apply select percentile
+			# apply select percentile
-            selector = SelectPercentile(percentile=percentile)
+			selector = SelectPercentile(percentile=percentile)
-            selector.fit(training_data, y[train])
+			selector.fit(training_data, y[train])
-            # new reduced data sets
+			# new reduced data sets
-            training_data_r = selector.transform(training_data)
+			training_data_r = selector.transform(training_data)
-            testing_data_r = selector.transform(testing_data)
+			testing_data_r = selector.transform(testing_data)
-            #fit classifier
+			#fit classifier
-            classifier.fit(training_data_r, y[train])
+			classifier.fit(training_data_r, y[train])
-            #predict class
+			#predict class
-            predictions_train = classifier.predict(training_data_r)
+			predictions_train = classifier.predict(training_data_r)
-            predictions_test = classifier.predict(testing_data_r)
+			predictions_test = classifier.predict(testing_data_r)
-            #print and store metrics
+			#print and store metrics
-            rec = recall_score(y[test], predictions_test, average='macro')
+			rec = recall_score(y[test], predictions_test)
-            print('rec: ' + str(rec))
+			print('rec: ' + str(rec))
-            recall_scores.append(rec)
+			recall_scores.append(rec)
-            prec = precision_score(y[test], predictions_test, average='macro')
+			prec = precision_score(y[test], predictions_test)
-            print('prec: ' + str(prec))
+			print('prec: ' + str(prec))
-            print('#')
+			print('#')
-            precision_scores.append(prec)
+			precision_scores.append(prec)
-            # equation for f1 score
+			# equation for f1 score
-            f1_scores.append(2 * (prec * rec)/(prec + rec))
+			f1_scores.append(2 * (prec * rec)/(prec + rec))
-            #class_prob.append(classifier.class_prior_)
+			#class_prob.append(classifier.class_prior_)
-            #class_counts.append(classifier.class_count_)
+			#class_counts.append(classifier.class_count_)
-        ##########################
+		##########################
-        #print metrics of test set
+		#print metrics of test set
-        print('-------------------------')
+		print('-------------------------')
-        print('prediction of testing set:')
+		print('prediction of testing set:')
-        print('Precision score: min = {}, max = {}, average = {}'
+		print('Precision score: min = {}, max = {}, average = {}'
-                .format(min(precision_scores),
+				.format(min(precision_scores),
-                        max(precision_scores),
+						max(precision_scores),
-                        sum(precision_scores)/float(len(precision_scores))))
+						sum(precision_scores)/float(len(precision_scores))))
-        print('Recall score: min = {}, max = {}, average = {}'
+		print('Recall score: min = {}, max = {}, average = {}'
-                .format(min(recall_scores),
+				.format(min(recall_scores),
-                        max(recall_scores),
+						max(recall_scores),
-                        sum(recall_scores)/float(len(recall_scores))))
+						sum(recall_scores)/float(len(recall_scores))))
-        print('F1 score: min = {}, max = {}, average = {}'
+		print('F1 score: min = {}, max = {}, average = {}'
-                .format(min(f1_scores),
+				.format(min(f1_scores),
-                        max(f1_scores),
+						max(f1_scores),
-                        sum(f1_scores)/float(len(f1_scores))))
+						sum(f1_scores)/float(len(f1_scores))))
-        # print()
+		# print()
-        # # print probability of each class
+		# # print probability of each class
-        # print('probability of each class:')
+		# print('probability of each class:')
-        # print()
+		# print()
-        # #print(class_prob)
+		# #print(class_prob)
-        # print()
+		# print()
-        # print('number of samples of each class:')
+		# print('number of samples of each class:')
-        # print()
+		# print()
-        # #print(class_counts)
+		# #print(class_counts)
-        # print()
+		# print()
 ######## nur für resubstitutionsfehler benötigt ########
 	def analyze_errors(dataset, sklearn_cv):
 		'''calculates resubstitution error
 		shows indices of false classified articles
 		uses Gaussian Bayes with train test split
 		'''
 		X_train_test = dataset['Title'] + ' ' + dataset['Text']
 		y_train_test = dataset['Label']
 		if sklearn_cv:
 				# use sklearn CountVectorizer
 				cv = CountVectorizer()
 				# fit the training data and then return the matrix
 				training_data = cv.fit_transform(X_train_test, y_train_test).toarray()
 				# transform testing data and return the matrix
 				testing_data = cv.transform(X_train_test).toarray()
 		else:
 			# use my own BagOfWords python implementation
 			stemming = True
 			rel_freq = True
 			extracted_words = BagOfWords.extract_all_words(X_train_test)
 			vocab = BagOfWords.make_vocab(extracted_words)
 			# fit the training data and return the matrix
 			training_data = BagOfWords.make_matrix(extracted_words,
 							vocab, rel_freq, stemming)
 			testing_data = training_data
 		# Naive Bayes
 		classifier = MultinomialNB(alpha=1.0e-10,
 								   fit_prior=False,
 								   class_prior=None)
 		# fit classifier
 		classifier.fit(training_data, y_train_test)
 		# Predict class
 		predictions = classifier.predict(testing_data)
 		print('Errors at index:')
 		print()
 		n = 0
 		for i in range(len(y_train_test)):
 			if y_train_test[i] != predictions[i]:
 				n += 1
 				print('error no.{}'.format(n))
 				print('prediction at index {} is: {}, but actual is: {}'
 				.format(i, predictions[i], y_train_test[i]))
 				print(X_train_test[i])
 				print(y_train_test[i])
 				print()
 		#print metrics
 		print('F1 score: ', format(f1_score(y_train_test, predictions)))
--- a/thesis/LV.bib
+++ b/thesis/LV.bib
@ -1,17 +0,0 @@
@BOOK{pierson2016,
 	AUTHOR="Lillian Pierson",
 	TITLE="Data Science für Dummies",
 	PUBLISHER="WILEY-VCH Verlag GmbH \& Co. KGaA",
 	YEAR=2016
 }
 #stanford NER:
@PAPER{finkel2005,
        AUTHOR="Jenny Rose Finkel, Trond Grenager, Christopher Manning",
        TITLE="Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling. Proceedings of the 43nd Annual Meeting of the Association for Computational Linguistics",
        PUBLISHER="ACL"
        YEAR=2005
        }
 # pp. 363-370. #http://nlp.stanford.edu/~manning/papers/gibbscrf3.pdf
 # webhose.io Dokumentation:
 # https://docs.webhose.io/docs/output-reference
--- a/thesis/UHH-Logo_2010_Farbe_CMYK.pdf
+++ b/thesis/UHH-Logo_2010_Farbe_CMYK.pdf
--- a/thesis/images/Data_Processing_Pipeline_251018.png
+++ b/thesis/images/Data_Processing_Pipeline_251018.png
--- a/thesis/images/Hist_10CommonWords_100rows_2.png
+++ b/thesis/images/Hist_10CommonWords_100rows_2.png
--- a/thesis/images/NER_old_50bins.png
+++ b/thesis/images/NER_old_50bins.png
--- a/thesis/images/WordCloud_allRows_best.png
+++ b/thesis/images/WordCloud_allRows_best.png
--- a/thesis/images/art_length_200bins_best.png
+++ b/thesis/images/art_length_200bins_best.png
--- a/thesis/refs.bib
+++ b/thesis/refs.bib
@ -1,23 +0,0 @@
@BOOK{BOOK:1,
 	  AUTHOR="Lillian Pierson",
 	  TITLE="Data Science für Dummies",
 	  PUBLISHER="Wiley-VCH Verlag GmbH \& Co. KGaA",
 	  YEAR=2016
 }
 #stanford NER:
@ARTICLE{ARTICLE:1,
         AUTHOR="Jenny Rose Finkel, Trond Grenager, Christopher Manning",
         TITLE="Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling. Proceedings of the 43nd Annual Meeting of the Association for Computational Linguistics",
         JOURNAL="ACL",
         PUBLISHER="ACL",
         YEAR=2005}
 # pp. 363-370. #http://nlp.stanford.edu/~manning/papers/gibbscrf3.pdf
@MISC{WEBSITE:1,
  	  HOWPUBLISHED="\url{https://docs.webhose.io/docs/output-reference}",
 	  AUTHOR = "Intel",
 	  TITLE = "Webhose.io",
 	  MONTH = "Oct",
 	  YEAR = "1999",
 	  NOTE = "Accessed on 2018-10-19"
 }
--- a/thesis/thesis.tex
+++ b/thesis/thesis.tex
@ -1,688 +0,0 @@
 \documentclass[11pt,a4paper]{scrbook}
 \usepackage{geometry}	
 \usepackage[utf8]{inputenc}
 \usepackage[T1]{fontenc}
 \usepackage[pdftex]{graphicx}
 %package to manage images
 \usepackage{graphicx} 
 \graphicspath{ {./images/} }
 \usepackage[rightcaption]{sidecap}
 \usepackage{wrapfig}
 %for lists
 \usepackage{listings}
 \usepackage{enumitem}
 \usepackage{colortbl}	
 \usepackage{xcolor}
 \usepackage{soul}
 \usepackage{cleveref}
 \usepackage{todonotes}
 %for hyperlinks
 \usepackage{hyperref}
 \AtBeginDocument{\renewcommand{\chaptername}{}}
 % Kommentare Julian
 \newcommand{\jk}[1]{\todo[inline]{JK: #1}}
 \renewcommand{\familydefault}{\sfdefault}
 % Fragen Anne
 \definecolor{comments}{cmyk}{1,0,1,0}
 \newcommand{\al}[1]{\todo[inline]{\color{comments}{Anne: #1}}}
 \definecolor{uhhred}{cmyk}{0,100,100,0}
 \begin{document}
 \frontmatter
 \newgeometry{centering,left=2cm,right=2cm,top=2cm,bottom=2cm}
 \begin{titlepage}
 \includegraphics[scale=0.3]{UHH-Logo_2010_Farbe_CMYK.pdf}
 \vspace*{2cm}
 \Large
 \begin{center} 
      {\color{uhhred}\textbf{\so{BACHELORTHESIS}}}
 \vspace*{2.0cm}\\
 {\LARGE \textbf{Prediction of Company Mergers\\Using Interactive Labeling\\and Machine Learning Methods}}
 % OR: Incremental labeling of an unknown data set using the example of classification of news articles OR
 % OR: Recognizing M\&As in News Articles\\Using Interactive Labeling\\and Machine Learning Methods
 % OR: Interactive Labeling of Unclassified Data\\Using the Example of Recognition of Company Mergers
 \vspace*{2.0cm}\\
 vorgelegt von
 \vspace*{0.4cm}\\
 Anne Lorenz
 \end{center}
 \vspace*{3.5cm}
 \noindent 
 MIN-Fakultät \vspace*{0.4cm} \\ 
 Fachbereich Informatik \vspace*{0.4cm} \\ 
 Studiengang: Software-System-Entwicklung \vspace*{0.4cm} \\ 
 Matrikelnummer: 6434073 \vspace*{0.8cm} \\ 
 Erstgutachter: Dr. Julian Kunkel \vspace*{0.4cm} \\ 
 Zweitgutachter: Eugen Betke
 \vspace*{0.8cm} \\
 Betreuer: Dr. Julian Kunkel, Doris Birkefeld
 \end{titlepage}
 \restoregeometry
 \chapter*{Abstract}
 BLABLA ABSTRACT
 %So objektiv, kurz, verständlich, vollständig und genau wie möglich :-)
 \tableofcontents
 \mainmatter 
 %Kapitel 1 Einleitung
 %####################
 \chapter{Introduction} 
 \label{chap:introduction}
 \textit{
 In this chapter...In Section \ref{sec:motivation} the motivation, then in Section \ref{sec:goals} the goals...
 }
 \section{Motivation} 
 \label{sec:motivation}
 Given a classification problem, there is always a labeled data set needed first to apply a machine learning model and make predictions possible. The larger the labeled data set is, the better are generally the predictions. However, to get there, each single data element must first be classified manually. Depending on the type of data, this procedure can be very time-consuming, for example if longer texts have to be read.
 In this thesis we want to present an alternative data labeling method that allows to label a larger amount of data in a shorter time. 
 \section{Goals} 
 \label{sec:goals}
 \jk{Ein Satz welcher das Problem beschreibt, dannach dann runtergebrochen in Teilaufgaben}
 We want to compare a conventional method of data labeling with an alternative, incremental method using the following example: The aim is to investigate news articles about recent mergers ('mergers and acquisitions') and to classify them accordingly. With the help of the labeled data set, different classification models will be applied and optimized so that a prediction about future news articles will be possible. 
 \section{Outline}
 % hier steht was über die Gliederung...
 \bigskip
 \paragraph{Summary:} 
 \textit{\newline In this chapter we discussed ... The following chapter deals with blabla.}
 %Kapitel 2 Stand der Technik 
 %##########################
 \chapter{State of the Art} 
 \label{state_of_the_art}
 \textit{In this chapter the current state of research in the field of... will be presented.
 }
 \section{State of Research}
 \al{Was soll hier rein? Kann mir darunter leider nichts vorstellen.}
 \bigskip
 \paragraph{Summary:} 
 \textit{\newline In this chapter we have described ... are described in the next chapter. In the next chapter we describe...
 }
 %Kapitel 3 Grundlagen 
 %#################### 
 \chapter{Background and Related Work} 
 \label{chap:background}
 \textit{
 In this chapter...In Section \ref{sec:news} news sources are introduced...
 }
 \section{Business News about Mergers} 
 \label{sec:news}
 \subsection{Company Mergers} 
 When two companies merge, ... When shares of a company are sold,...
 \subsection{Webhose.io as Source for News Articles} 
 As a source for our initial data set, RSS feeds from established business news agencies such as \textit{Reuters} or \textit{Bloomberg} come into consideration. However, when crawling RSS feeds, it is not possible to retrieve news from a longer period in the past. Since we want to analyze news of the period of 12 months, we obtain the data set from the provider \textit{webhose.io}\footnote{\url{<https://webhose.io/>}}. It offers access to English news articles from sections like \textit{Financial News}, \textit{Finance} and \textit{Business} at affordable fees compared to the news agencies' offers. As we are only interested in reliable sources, we limit our request to the websites of the news agengies \textit{Reuters, Bloomberg, Financial Times, CNN, The Economist} and \textit{The Guardian}. 
 \section{Supervised Machine Learning Problems} 
 \subsubsection{Structured and Unstructured Data} 
 \subsection{Classification Problems}
 \subsubsection{Binary Classification}
 Vergleichbar mit Spamfilterung...
 \subsubsection{Multiple Classification}
 \subsection{Balanced / Unbalanced Data Set}
 \section{Text Analysis} 
 \subsection{Natural Language Processing (NLP)} 
 \subsection{Tokenization}
 \subsection{Unigram, Bigram} 
 \subsection{Stemming} 
 \subsection{Feature Vectors, Document Term Matrix}
 \subsubsection{Word Frequencies} 
 \subsection{Bag of Words (BOW)} 
 \subsection{Stop Words} 
 \subsection{Named Entity Recognition (NER)} 
 \section{Machine Learning Models} 
 \subsection{Naive Bayes Classifier} 
 \subsection{Support Vector Machines Classifier (SVM)} 
 \subsection{Decision Trees Classifier} 
 \section{Tuning Options} 
 \subsection{Split Methods} 
 \subsubsection{Test-Train-Split}
 \subsubsection{Shuffle Split}
 \subsubsection{Stratified Split}
 \subsubsection{(K-fold) Cross-Validation}
 \subsection{Hyperparameters} 
 \subsection{Feature Selection}
 \section{Metrics}
 \subsection{Accuracy, Error Rate, Sensitivity, Specifity}
 Sensitivity(=true positive rate) and Specificity(=true negative rate)
 \subsection{Recall, Precision, F1-score} 
 \subsection{Robustness}
 \subsection{Overfit, Underfit}
 \subsection{Bias, Variance}
 \subsection{Resubstitution Error}
 \bigskip
 \paragraph{Summary:} 
 \textit{\newline
 In this chapter we ... blabla are described in section bla.
 In the next chapter we describe...
 }
 %Kapitel 4 Design
 %###########################
 \chapter{Design} 
 \label{chap:design}
 \textit{
 In this chapter... In Section \ref{sec:overview} we give an overview of all, then in Section the data processing pipeline, blablabla...
 }
 \jk{Was muss insgesamt gemacht werden, welche Teilprobleme müssen addressiert werden. Alternativen besprechen, Entscheidungen fällen basierend auf Kriterien. Hier kommt Deine Arbeit hin, kein Related work oder Methoden die es schon gibt. Nur falls man es Vergleicht, dann relevant.}
 \section{Overview}
 \label{sec:overview}
 % Data Selection > Data Labeling > Data Preprocessing > Model Selection > Recognition of Merger Partners
 \vspace{1.0cm}
 \begin{figure}[h]
 \centering
 \includegraphics[width=\textwidth]{images/Data_Processing_Pipeline_251018}
 \caption{Data Processing Pipeline}
 \label{fig:pipeline}
 \end{figure}
 \vspace{1.0cm}
 As shown in Figure \ref{fig:pipeline}, we first need to select appropriate data, then label a data set manually, then, ...\\
 \\
 \section{Data Selection}
 \label{sec:data_selection}
 \subsection{Downloading the Data}
 Before we can start with the data processing, we have to identify and select appropriate data. We downloaded news articles of 12 months (year 2017) from the website \textit{webhose.io}.
 To retrieve our data, we make the following request\footnote{On \url{https://docs.webhose.io/docs/filters-reference} you can learn more about the possible filter settings of \textit{webhose.io.}}:\\\\
 \texttt{
 site:(reuters.com OR ft.com OR cnn.com OR economist.com\\
 \noindent\hspace*{12mm}%
 OR bloomberg.com OR theguardian.com)\\
 site\_category:(financial\_news OR finance OR business)\\
 \\
 timeframe:january2017-december2017} \\
 \\
 The requested data was downloaded in September 2018 with JSON as file format. Every news article is saved in a single file, in total 1.478.508 files were downloaded (4,69 GiB).
 Among others, one JSON file contains the information shown in the following example :\\
 \begin{lstlisting}[breaklines=true]
 {
      "thread": {
        "uuid": "a931e8221a6a55fac4badd5c6992d0a525ca3e83",
        "url": "https://www.reuters.com/article/us-github-m-a-microsoft-eu/eu-antitrust-ruling-on-microsoft-buy-of-github-due-by-october-19-idUSKCN1LX114",
        "site": "reuters.com",
        "site_section": "http://feeds.reuters.com/reuters/financialsNews",
        "section_title": "Reuters | Financial News"
        "published": "2018-09-17T20:00:00.000+03:00"
        "site_type": "news",
        "spam_score": 0.0,
      },
      "title": "EU antitrust ruling on Microsoft buy of GitHub due by October 19",
      "text": "BRUSSELS (Reuters)-EU antitrust regulators will decide by Oct. 19 whether to clear U.S. software giant Microsoft's $7.5 billion dollar acquisition of privately held coding website GitHub. Microsoft, which wants to acquire the firm to reinforce its cloud computing business against rival Amazon, requested European Union approval for the deal last Friday, a filing on the European Commission website showed on Monday. The EU competition enforcer can either give the green light with or without demanding concessions, or it can open a full-scale investigation if it has serious concerns. GitHub, the world's largest code host with more than 28 million developers using its platform, is Microsoft's largest takeover since the company bought LinkedIn for $26 billion in 2016. Microsoft Chief Executive Satya Nadella has tried to assuage users' worries that GitHub might favor Microsoft products over competitors after the deal, saying GitHub would continue to be an open platform that works with all the public clouds. Reporting by Foo Yun Chee; Editing by Edmund Blair",
      "language": "english",
      "crawled": "2018-09-18T01:52:42.035+03:00"
 }
 \end{lstlisting}
 As \textit{webhose.io} is a secondary source for news articles and only crawls the news feeds itself, it may occur that some RSS feeds are not parsed correctly or a article is tagged with a wrong topic as \textit{site categories}. The downloaded files also contain blog entries, user comments, videos or graphical content and other spam which we have to filter out. We also do not need pages quoting Reuters etc.. Besides this, we are only interested in English news articles.\\
 After we have filtered out all the irrelevant data, we receive a data set of \textbf{41.790} news articles that we store in multiple csv files\footnote{All csv files have a total size of 109 MB.}, one for each month.
 \subsection{Selecting the Working Data Set}
 \label{subsec:data_selection}
 We have received a different number of articles from each month. Because we want the items for our initial working data set to be fairly distributed throughout the year, we select 833 articles from each month\footnote{We select 834 from every third month: (8 * 833) + (4 * 834) = 10.000.} to create a csv file containing \textbf{10.000} articles with a total size of 27 MB. 
 The csv file has the following 7 columns:
 \begin{center}
 \begin{tabular}{|c|c|c|c|c|c|c|}
 \hline
 Uuid & Title & Text & Site & SiteSection & Url & Timestamp\\
 \hline
 \end{tabular}
 \end{center}
 \begin{itemize}
 \item \textbf{Uuid:} Universally unique identifier, representing the article's thread.
 \item \textbf{Title:} The news article's headline.
 \item \textbf{Text:} The article's plain text.
 \item \textbf{Site:} The top level domain of the article's site.
 \item \textbf{SiteSection:} The link to the section of the site where the thread was created.
 \item \textbf{Url:} The link to the top of the article's thread.
 \item \textbf{Timestamp:} The thread's publishing date and time in the format YYYY-MM-DDThh:mm (GMT+3).
 \end{itemize}
 The columns \textbf{Title} and \textbf{Text} contain our main data, whereas the rest of the attributes is the meta data.
 We explore this data set in more datail in Chapter \ref{chap:exploration}.
 \section{Data Labeling}
 Here we explain our two different approaches of labeling data sets.
 \subsection{Conventional Method} 
 \subsubsection{Top-Down / Waterfall}
 \begin{enumerate}[label=(\alph*)]
 \item \textbf{Data Labeling}
 \item \textbf{Data Cleaning}
 \item \textbf{Model Building}
 \item \textbf{Analysis of wrong predicted instances}\\
 => optionally back to step (a) \footnote{In practice, this step is rarely done.}
 \item \textbf{New Hypotheses}\\
 => back to (c); optionally back to step (b)
 \end{enumerate}
 \subsection{Interactive Method}
 \subsubsection{Visual Analyticts} 
 \subsubsection{Agile Model Development}
 \subsubsection{Unbalanced Data Set} 
 \section{Data Preprocessing}
 In order to use the news articles for machine learning algorithms, we must first prepare and filter the texts appropriately:
 \begin{description}
 \item \textbf{Removing punctuation marks}\\
 We replace all punctuation marks with white spaces.
 \item \textbf{Tokenization}\\
 Every news article is split into a list of single words.
 \item \textbf{Leaving out numbers}\\
 We ignore all numbers in the news article.
 \item \textbf{Transforming words to lower case}\\
 Every word is transformed to lower case.
 \item \textbf{Word stemming}\\
 We reduce every word to its word stem (i.e. 'approves' to 'approv').
 \item \textbf{Ignoring stop words}\\
 We filter out extremely common words ('a', 'about', 'above', 'after', 'again', etc.) and other unwanted terms ('reuters', 'bloomberg', etc.).
 \end{description}
 \section{Model Selection}
 \subsection{Naive Bayes} 
 GaussianNB vs MultinomialNB
 \subsection{SVM} 
 \subsection{Decision Tree} 
 \section{Recognition of merger partners} 
 \subsubsection{Named Entity Recognition (NER)} 
 \bigskip
 \paragraph{Summary:} 
 \textit{\newline
 In this chapter we... In the next chapter...
 }
 % Kapitel 5 Data Exploration
 %###########################
 \chapter{Data Exploration}
 \label{chap:exploration}
 \textit{
 In this chapter we explore our textual corpus of news articles.
 }
 \section{Text Corpus Exploration} 
 The textual corpus\footnote{We describe the initial data set in detail in Chapter \ref{chap:design}, Section \ref{subsec:data_selection}.} contains of the news articles' headlines and plain texts, if not specified otherwise. For the sake of simplicity we use the unigram model for our analysis.
 \subsection{Sources for News Articles}
 As illustrated in Table \ref{table:sources}, the main source for news articles in our data set is \textit{Reuters.com}. This is due to the fact that Webhose.io does not have equal access to the desired sources and, above all, the news from Reuters.com has been parsed with the required quality.
 \begin{center}
 \begin{table}[h]
 \centering
 \begin{tabular}{|l|r|}
 \hline
 reuters.com & 94\% \\ 
 theguardian.com & 3\% \\ 
 economist.com & 2\% \\ 
 bloomberg.com & < 1\% \\ 
 cnn.com & < 1\% \\ 
 ft.com & < 1\% \\
 \hline
 \end{tabular}
 \caption{Article sources in the data set}
 \label{table:sources}
 \end{table}
 \end{center}
 \al{Ist es ein Problem, dass Reuters die Hauptquelle ist?}
 \subsection{Number of Features}
 The document term matrix of the entire data set has 47.545 features.
 \subsection{Length of Articles}
 The average length of the news articles examined is 2476 characters\footnote{headlines excluded}. The distribution of the article length in the dataset is shown in Figure \ref{fig:article_length}.
 \begin{figure}[h]
 \centering
 \includegraphics[width=\textwidth]{images/art_length_200bins_best.png}
 \caption{Histogram of article lengths}
 \label{fig:article_length}
 \end{figure}
 \subsection{Most Common Words}
 The 10 most common words in the data set are: \textit{'percent', 'fitch', 'billion', 'new', 'business', 'market', 'next', 'million', 'ratings', 'investors'}.
 % toDo
 \al{Ist nur von den 100 ersten Artikeln im Datensatz, muss noch ausgetauscht werden.}
 \begin{figure}[h]
 \centering
 \includegraphics[width=\textwidth]{images/Hist_10CommonWords_100rows_2.png}
 \caption{Bar chart of 10 most frequent words in data set}
 \label{fig:10_most_common}
 \end{figure}
 % Erst Schaubild/WordCloud von ganzem Korpus,
 % dann nur die Artikel über Fusion.
 \begin{figure}[h]
 \centering
 \includegraphics[width=\textwidth]{images/WordCloud_allRows_best.png}
 \caption{WordCloud of most frequent words in data set}
 \label{fig:wordcloud}
 \end{figure}
 \subsection{Distribution of Company Names}
 %=> toDo!!!
 \al{Hier geht es noch um den alten Datensatz, muss noch ausgetauscht werden.}
 'Comcast' is the most frequently used company name in the data set. Figure \ref{fig:company_names} shows that big companies dominate the reporting about mergers. In order to use a fairly distributed data set for model selection, we limit the number of articles used to 3 per company name.
 \begin{figure}[h]
 \centering
 \includegraphics[width=\textwidth]{images/NER_old_50bins.png}
 \caption{Histogram of Company Names Distribution}
 \label{fig:company_names}
 \end{figure}
 \bigskip
 \paragraph{Summary:}
 \textit{\newline
 In this chapter we... In the next chapter...
 }
 % Kapitel 6 Labeling
 %###########################
 \chapter{Data Labeling}
 \label{chap:labeling}
 \textit{
 This chapter describes and compares two different data labeling processes; a conventional labeling method and an interactive method.
 }
 \section{Conventional Method}
 \subsection{Data Set}
 First, we label a slightly smaller data set in a conventional way. The dataset consists of 1497 news articles, which were downloaded via \textit{webhose.io}. The dataset contains news articles from different Reuters' RSS feeds dating from the period of one month \footnote{The timeframe was May 25 - June 25 2018, retrieved on June 25 2018.}. Here, we only filter out articles that contain at least one of the keywords \textit{'merger', 'acquisition', 'take over', 'deal', 'transaction'} or \textit{'buy'} in the heading.
 With the following query\footnote{Please read more about the possible filter settings on the website \url{https://docs.webhose.io/docs/filters-reference}} we download the desired data from \textit{webhose.io}:\\\\
 \texttt{
 thread.title:(merger OR merges OR merge OR merged 
    OR acquisition 
    OR "take over"   
    \noindent\hspace*{42mm}%
   OR "take-over" OR takeover 
    OR deal OR transaction OR buy) \\
 is\_first:true \\
 site\_type:news \\
 site:reuters.com \\
 language:english}
 \subsection{Classification}
 The articles are classified binary with the labels:
 \begin{description}
 \item[0:]{merger of company A and B}
 \item[1:]{other}
 \end{description}
 The process of reading and labeling the 1497 news articles takes about 30 hours in total.
 \subsection{Difficulties}
 Some article texts were difficult to classify even when read carefully.
 Here are a few examples of the difficulties that showed up:
 \begin{itemize}
 \item \textit{'Company A acquires more than 50\% of the shares of company B.'}\\ => How should share sales be handled? Actually, this means a change of ownership, even if it is not a real merger.
 \item \textit{'Company X will buy/wants to buy company Y.'} \\=> Will the merger definitely take place? On what circumstances does it depend?
 \item \textit{'Last year company X and company Y merged. Now company A wants to invest more in renewable energies.'}\\ => Only an incidental remark deals with a merger that is not taking place right now. The main topic of the article is about something completely different.
 \end{itemize}
 These difficulties led to the idea of using different labeling classes, which we finally implemented in the interactive labeling method.
 \section{Incremental Method}
 %Vorteil: könnte bessere Ergebnisse bringen, da allgemeiner / größere Menge
 \subsection{Data Set}
 For the interactive labeling method, we use the data set of 10.000 articles from a whole year described in Chapter \ref{chap:design}, Section \ref{sec:data_selection}.
 \subsection{Classification}
 For the multiple classification we use the following 6 classes:
 \begin{description}
 \item[1:]{merger of company A and B}
 \item[2:]{merger is pending}
 \item[3:]{merger is aborted}
 \item[4:]{sale of shares}
 \item[5:]{merger as incidental remark, not main topic}
 \item[6:]{other / irrelevant news}
 \end{description}
 \subsection{Selection of Articles} 
 \subsection{Procedure}
 %Wähle von jedem Monat 10 Artikel zufällig aus.
 %Es ist wahrscheinlich dann man nur Merger mit vielen Artikeln hat => Das könnte man minimieren indem man “stratified” sampling macht => Zuerst NER machen, danach fair über Klassen randomisieren => wähle 10 Artikel von 100 Kategorien aus => 10 Kategorien auswählen => darunter zufällig ein Artikel . Labeln von 1\% aller Artikel
 %1) Erste Modelle bauen z.b. Bayes . Auf alle Artikel anwenden => Wahrscheinlichkeit pro Klasse Vektor: (K1, K2, … , K6)
 %Klare Fälle: Kx > 80\% und alle anderen Ky < 10\% (mit x in {1-6}, y != x)
 %=> Label übernehmen => wie viele Fälle sind eindeutig?
 %Behauptung: 10\% aller Artikel sind eindeutig
 %Stichprobenartig überprüfen => 10 Artikel random auswählen von jeder Klasse
 %Identifikation von äußert unklaren Fällen
 %Mehr als eine Klasse hat ähnliche Wahrscheinlichkeit
 %(5\%, 5\%, 5\%, …) => (80\%, 80\%, 0\%, 0\%, …)
 %z.b. 100 Artikel angucken und manuell label
 %=> Wiederhole ich 3-4 mal gehe zu Schritt 1) (Modell bauen)
 %=> 95\% aller Fälle sind jetzt klar.
 %=> warum gehen die 5\% nicht? Stichprobenartig Artikel anschauen
 %Falls das nicht klappt, Modelle oder Preprozessing (z.b. NER) verbessern
 \subsection{Tagging of Named Entities} 
 Histogram: X: Autoren/Personen, Unternehmen, Y: Anzahl der Nennungen
 \bigskip
 \paragraph{Summary:} 
 \textit{\newline
 In this chapter...in the next chapter...
 }
 % Kapitel 7 Implementierung
 %##########################
 \chapter{Implementation} 
 \label{chap:implementation}
 \textit{
 This chapter deals with the most relevant parts of the implementation.
 }
 \section{Data Download}
 \label{sec:data_download}
 \section{Python Modules} 
 \subsection{nltk} 
 \subsection{pandas} 
 \subsection{sklearn}
 \subsection{webhose.io}
 \section{Jupyter Notebook}
 For interactive coding, labeling, visualization and documentation.
 \section{Own Implementation}
 \subsection{Examples} 
 \bigskip
 \paragraph{Summary:} 
 \textit{\newline
 In this chapter, we...In the next chapter...
 }
 % Kapitel 8 Evaluation  
 %##########################
 \chapter{Evaluation} 
 \label{chap:evaluation}
 \textit{
 In this chapter we evaluate the different machine learning methods.
 }
 \section{Model Fitting}
 % Algos auf Recall statt F1 optimieren bzw. beides ausgeben lassen
 %dran denken: einzelne Hyperparameter SEPARAT variieren
 % Variante: wenn ich modell nur auf 'Title' anwende, sogar noch besser!
 % Alle Metriken, Robustheit, Over-/Underfit etc. in Tabelle zur Übersicht!!
 % variieren: SelectPercentile, BOW/CountVectorizer, Preprocessing(stopwords, stemming,...) verändern, SelectPercentile (1,5,25,75,100), Hyperparameter(alpha, gamma=0.0001.,, C, ...) mit/ohne Text) => alles 	dokumentieren
 \subsection{Naive Bayes Model}
 Multinomial Naive Bayes
 Grid-Search
 \subsection{SVM}
 % 5-fold-cross ausprobieren
 % SVM bestes Ergebnis mit ALTEM Datensatz:
 % best score: 0.876
 % best parameters set found on development set:
 % C: 0.1, gamma: 0.0001, kernel: linear, percentile: 50
 \subsection{Decision Tree}
 % wichtigste 20 features ausgeben lassen!
 % einfaches test_train_split (0.25) nur auf Title in altem Dataset benutzt:
 \al{Das ist noch von altem Datensatz, muss noch ausgetauscht werden.}
 20 most important words in testing set:
 ['merger', 'buy', 'monsanto', 'warner', 'win', 'walmart', '2', 'billion', 'kkr', 'rival', 'uk', 'watch', 'jv', 'merg', 'get', 'non', 'anz', 'xerox', 'clear', 'deal']
 \section{Recognition of Merger Partners}
 % Stanford-Variante erzielt ganz gute Ergebnisse.
 \section{Performance}
 \bigskip
 \paragraph{Summary:} 
 \textit{\newline
 In this chapter we have described ... In the last chapter we describe...
 }
 % Kapitel Zusammenfassung  
 %#############################
 \chapter{Summary} 
 \label{chap:summary}
 \section{Comparison of Labeling Methods}
 \section{Quality of Predictions} 
 \section{Conclusions} 
 \section{Future Work} 
 \subsubsection{}
 The task of this work could also be solved by using an artificial neural network (ANN). %Genauere Erklärung fehlt noch.
 This may lead to even better results.
 \bigskip
 \paragraph{Summary:} 
 \textit{\newline
 In the last chapter we have described ....
 }
 % nicht als Kapitel:
 \nocite{*}
 % List of figures
 \addcontentsline{toc}{chapter}{List of Figures}
 \listoffigures
 % Literaturliste
 \addcontentsline{toc}{chapter}{Bibliography}
 \bibliographystyle{ieeetr}  \bibliography{refs}
 \backmatter 
 \thispagestyle{empty}
 \vspace*{\fill}
 \pagestyle{empty}
 {\normalsize
 \begin{center}\textbf{Eidesstattliche Erklärung}\end{center}
 Hiermit versichere ich an Eides statt, dass ich die vorliegende Arbeit im Bachelorstudiengang Wirtschaftsinformatik selbstständig verfasst und keine anderen als die angegebenen Hilfsmittel – insbesondere keine im Quellenverzeichnis nicht benannten Internet-Quellen – benutzt habe. Alle Stellen, die wörtlich oder sinngemäß aus Veröffentlichungen entnommen wurden, sind als solche kenntlich gemacht. Ich versichere weiterhin, dass ich die Arbeit vorher nicht in einem anderen Prüfungsverfahren eingereicht habe und die eingereichte schriftliche Fassung der auf dem elektronischen Speichermedium entspricht.
 \vspace*{1cm}\\
 Hamburg, den 01.03.2019
 \hspace*{\fill}\begin{tabular}{@{}l@{}}\hline
 \makebox[5cm]{Anne Lorenz}
 \end{tabular}
 \vspace*{3cm}
 %Dies ist optional, ggf. löschen!
 \begin{center}\textbf{Veröffentlichung}\end{center}
 Ich stimme der Einstellung der Arbeit in die Bibliothek des Fachbereichs Informatik zu.
 \vspace*{1cm}\\
 Hamburg, den 01.03.2019
 \hspace*{\fill}\begin{tabular}{@{}l@{}}\hline
 \makebox[5cm]{Anne Lorenz}
 \end{tabular}
 }
 \vspace*{\fill} 
 \end{document}
--- a/visualization/Interactive_Labeling_Diagramm.png
+++ b/visualization/Interactive_Labeling_Diagramm.png