thesis-anne/src/2019-04-02-al-interactive-l...

1342 lines
69 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Jupyter Notebook for Interactive Labeling\n",
"______\n",
"\n",
"This Jupyter Notebook combines a manual and automated labeling technique.\n",
"By calculating estimated class probabilities, we decide whether a news article has to be labeled manually or can be labeled automatically.\n",
"For multiclass labeling, 3 classes are used.\n",
"\n",
"In each iteration we...\n",
"- check/correct the next article labels manually.\n",
" \n",
"- apply the SVM classification algorithm which returns a vector class_probs $(K_1, K_2, ... , K_6)$ per sample with the probabilities $K_i$ per class $i$. Estimated class labels are checked, if the estimated probability $K_x < t$ with $x \\in {1,...,6}$ and threshold $t$.\n",
" \n",
"Please note: User instructions are written in upper-case.\n",
"__________\n",
"Version: 2019-04-02, Anne Lorenz"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import csv\n",
"import operator\n",
"import pickle\n",
"import random\n",
"\n",
"from ipywidgets import interact, interactive, fixed, interact_manual\n",
"import ipywidgets as widgets\n",
"from IPython.core.interactiveshell import InteractiveShell\n",
"from IPython.display import display\n",
"import matplotlib\n",
"import matplotlib.pyplot as plt\n",
"import numpy as np\n",
"import pandas as pd\n",
"from sklearn.feature_extraction.text import CountVectorizer\n",
"from sklearn.feature_selection import SelectPercentile\n",
"from sklearn.metrics import recall_score, precision_score, f1_score, make_scorer\n",
"from sklearn.model_selection import GridSearchCV\n",
"from sklearn.model_selection import StratifiedKFold\n",
"from sklearn.pipeline import Pipeline\n",
"from sklearn.naive_bayes import MultinomialNB\n",
"from sklearn.semi_supervised import label_propagation\n",
"from sklearn.svm import LinearSVC\n",
"\n",
"from BagOfWords import BagOfWords\n",
"from MNBInteractive import MNBInteractive\n",
"from SVMInteractive import SVMInteractive\n",
"from SVMInteractive_wp import SVMInteractive_wp"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Part I: Data preparation\n",
"\n",
"First, we import our data set of 10 000 business news articles from a csv file.\n",
"It contains 833/834 articles of each month of the year 2017.\n",
"For detailed information regarding the data set, please read the full documentation."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"# initialize random => reproducible sequence\n",
"random.seed(5)\n",
"random_state = 5\n",
"\n",
"filepath = '../data/cleaned_data_set_without_header.csv'\n",
"\n",
"# set up wider display area\n",
"pd.set_option('display.max_colwidth', -1)\n",
"\n",
"# show full text for print statement\n",
"InteractiveShell.ast_node_interactivity = \"all\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We load the previously created dictionary of all article indices (keys) with a list of mentioned organizations (values).\n",
"In the following, we limit the number of occurences of a certain company name in all labeled articles to 3 to avoid imbalance."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"def show_next(index):\n",
" ''' this method displays an article's text and an interactive slider to set its label manually\n",
" '''\n",
" print('News article no. {}:'.format(index))\n",
" print()\n",
" print('HEADLINE:')\n",
" print(df.loc[df['Index'] == index, 'Title'])\n",
" print()\n",
" print('TEXT:')\n",
" print(df.loc[df['Index'] == index, 'Text'])\n",
" \n",
" def f(x):\n",
" # save user input\n",
" df.loc[df['Index'] == index, 'Label'] = x\n",
" df.loc[df['Index'] == index, 'Round'] = m\n",
"\n",
" # create slider widget for labels\n",
" interact(f, x = widgets.IntSlider(min=-1, max=2, step=1, value=df.loc[df['Index'] == index, 'Estimated']))\n",
" print('0: Other/Unrelated news, 1: Merger,') \n",
" print('2: Topics related to deals, investments and mergers')\n",
" print('___________________________________________________________________________________________________________')\n",
" print()\n",
" print()\n",
"\n",
"# list of article indices that will be shown next\n",
"label_next = []"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The iteration part starts here:\n",
"\n",
"## Part II: Manual checking of estimated labels"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"PLEASE INSERT M MANUALLY IF PROCESS HAS BEEN INTERRUPTED BEFORE."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"m=16"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"This round number: 16\n",
"Number of manually labeled articles: 1132\n",
"Number of manually unlabeled articles: 8868\n"
]
}
],
"source": [
"# read current data set from csv\n",
"df = pd.read_csv('../data/interactive_labeling_round_{}_temp.csv'.format(m),\n",
" sep='|',\n",
" usecols=range(1,13), # drop first column 'unnamed'\n",
" encoding='utf-8',\n",
" quoting=csv.QUOTE_NONNUMERIC,\n",
" quotechar='\\'')\n",
"\n",
"# find current iteration/round number\n",
"m = int(df['Round'].max())\n",
"print('This round number: {}'.format(m))\n",
"print('Number of manually labeled articles: {}'.format(len(df.loc[df['Label'] != -1])))\n",
"print('Number of manually unlabeled articles: {}'.format(len(df.loc[df['Label'] == -1])))"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1082"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Building the training data set using stratified sampling:"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"50\n"
]
}
],
"source": [
"labeled_pos_0 = df.loc[df['Label'] == 0].reset_index(drop=True)\n",
"labeled_pos_1 = df.loc[df['Label'] == 1].reset_index(drop=True)\n",
"labeled_pos_2 = df.loc[df['Label'] == 2].reset_index(drop=True)\n",
"\n",
"max_sample = min(len(labeled_pos_0), len(labeled_pos_1), len(labeled_pos_2))\n",
"print(max_sample)\n",
"\n",
"sampling_class0 = labeled_pos_0.sample(n=max_sample, random_state=random_state)\n",
"sampling_class1 = labeled_pos_1.sample(n=max_sample, random_state=random_state)\n",
"sampling_class2 = labeled_pos_2.sample(n=max_sample, random_state=random_state)\n",
"\n",
"training_data_0 = pd.concat([sampling_class0, sampling_class1, sampling_class2])"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"testing_data_1 = df.loc[(df['Label'] == -1)]\n",
"len(testing_data_1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We now check (and correct if necessary) the next auto-labeled articles."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Algorithmic Classification: ##"
]
},
{
"cell_type": "code",
"execution_count": 172,
"metadata": {},
"outputs": [],
"source": [
"# Idee: SVM wp Diese Klassen verwenden in Estimated, aber dann Proba LinearSVC zur Auswahl benutzen!"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"# SVM: starting interactive SVM...\n",
"\n",
"# ending SVM\n",
"Wall time: 4.41 s\n"
]
}
],
"source": [
"# call script with manually labeled and manually unlabeled samples\n",
"#%time classes, class_probs = SVMInteractive.estimate_svm(training_data_0, testing_data_1)\n",
"%time classes, predictions_test = SVMInteractive_wp.estimate_svm(training_data_0, testing_data_1)\n",
"#%time classes, class_count, class_probs = MNBInteractive.estimate_mnb(training_data_0, testing_data_1)"
]
},
{
"cell_type": "code",
"execution_count": 136,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[0.20873577, 0.45865831, 0.33260592],\n",
" [0.24515187, 0.432961 , 0.32188713],\n",
" [0.26248741, 0.42135557, 0.31615703],\n",
" [0.24401247, 0.43002251, 0.32596502],\n",
" [0.32956613, 0.37679369, 0.29364018],\n",
" [0.29423836, 0.39891716, 0.30684447],\n",
" [0.41115 , 0.32306568, 0.26578432],\n",
" [0.68406507, 0.16496631, 0.15096861],\n",
" [0.30203702, 0.39060799, 0.307355 ],\n",
" [0.53367057, 0.23765576, 0.22867367]])"
]
},
"execution_count": 136,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"class_probs[:10]"
]
},
{
"cell_type": "code",
"execution_count": 119,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"14"
]
},
"execution_count": 119,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"m"
]
},
{
"cell_type": "code",
"execution_count": 137,
"metadata": {},
"outputs": [],
"source": [
"# annotate highest estimated probability for every instance\n",
"maxima = []\n",
"for row in class_probs:\n",
" maxima.append(np.amax(row))"
]
},
{
"cell_type": "code",
"execution_count": 138,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"8888"
]
},
"execution_count": 138,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"text/plain": [
"8888"
]
},
"execution_count": 138,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(maxima)\n",
"len(class_probs)"
]
},
{
"cell_type": "code",
"execution_count": 139,
"metadata": {},
"outputs": [],
"source": [
"# save class_probs array\n",
"with open('../obj/'+ 'array_class_probs_round_{}_svm'.format(m) + '.pkl', 'wb') as f:\n",
" pickle.dump(maxima, f, pickle.HIGHEST_PROTOCOL)"
]
},
{
"cell_type": "code",
"execution_count": 140,
"metadata": {
"scrolled": false
},
"outputs": [
{
"data": {
"text/plain": [
"Text(0.5, 0, 'Highest estimated probability')"
]
},
"execution_count": 140,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"text/plain": [
"Text(0, 0.5, 'Fraction of articles with this highest estimated probability')"
]
},
"execution_count": 140,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAfUAAAEvCAYAAABc7VhYAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvOIA7rQAAIABJREFUeJzt3XmYXFWd//H3JxEEyQIKEzUsCQg4yCLQwyKMdhhQFgn+NGBY1LjxMBhQQDT8dBDRmQkgOIAIBGQIjBAFBgkkDjJKg0SQJLIEgkEGAibI4EZI2JN85497O6l0uqtOddetrT+v56mn7l7fOt3J6XPuud+jiMDMzMxa35BGB2BmZma14UrdzMysTbhSNzMzaxOu1M3MzNqEK3UzM7M24UrdzMysTbhSNzMzaxMVK3VJQ+sRiJmZmQ1MSkv9CUnnSdqp8GjMzMys31Iq9V2Bx4ErJd0n6XhJIwqOy8zMzKqkatLESno/cD2wKXAj8K2IeKKg2MzMzKwKSffUJY2XdDNwIXA+sC1wKzC74PjMzMws0ZsSjvkdcCdwXkT8qmT7jXnL3czMzJpAxe53SftHxD09tu0XEXMKjczMzMyqklKp/yYi9qi0zczMzBqrz+53SfsC7wO2kHRqya4RgJ9dNzMzazLl7qlvCAzLjxlesv1FYEKRQZmZmVn1Urrft4mIp+sUj5mZmfVTn5W6pH+LiC9JuhVY76CIGF90cGZmZpauXPf7tfn7d+oRiJmZmQ1MVRnlzMzMrHmVG/2+gF663btFxK6FRGRmZmb9Uu6e+jblTvTgOTMzs+bi7nczM7M20eeELpLuyd+XS3qx53v9QjQzM7MUbqmbmZm1iZRZ2pC0B7A/2cC5eyLigUKjMjMzs6qlzKd+JjAdeBuwOXC1pK8XHZiZmZlVJyVN7GPA7hHxar6+MfCbiPjbOsRnZmZmiSq21IHFwEYl628G/qeQaMzMzKzfyiWfuZjsHvprwKOS7sjXDwLuqU94ZmZmlqpc8plPlTsxIqYXEpGZmZn1ix9pMzMzaxMVH2mTtD3wr8BOlNxbj4htC4zLzMzMqpQyUO7fgUuBlcA44BrWTstqZmZmTSKlUt84In5O1lX/dEScBRxQbFhmZmZWrZSMcq9KGgL8TtJkYCnwN8WGZWZmZtVKST7zd8BjwKbAt4CRwLkRcV/x4ZmZmVmq5NHvkkYAERHLiw3JzMzM+iMl93uHpAXAw8ACSQ9J2rP40MzMzKwaKd3vDwNfiIhf5uv7A9+PiF3rEJ+ZmZklShkot7y7QgeIiHskNawLfvPNN48xY8Y06uMb4qWXXmKTTTZpdBhtwWVZGy7H2nA51ka7l+P8+fP/FBFbpBxbLvf7Hvni/ZIuB64ny/3+caBroEH215gxY5g3b16jPr4hurq66OzsbHQYbcFlWRsux9pwOdZGu5ejpKdTjy3XUj+/x/o3SpadW9bMzKzJ9FmpR8S4egZiZmZmA5My+n2kpAskzctf50saWY/gzMzMLF1KmtirgOXAUfnrRbJ88GZmZtZEUka/bxcRHytZ/6akB4sKyMzMzPonpaX+Sv5sOgCS9gNeKS4kMzMz64+UlvoJwDUl99H/Cnyq0kmSrgI+DDwfETv3sl/AhcChwMvApIj4TWrgZmZmtq6ylXo+O9uOEbFbnvudiHgx8dpXA98jm3+9N4cA2+evvcnmbN878dpmZmbWQ9nu94hYDUzOl1+sokInIu4G/lLmkCOAayJzH7CppHekXt/MzMzWldL9foekLwM/Al7q3hgR5SrsFKOB35esL8m3/aHngZKOB44HGDVqFF1dXQP86NayYsWKQfedi+KyrA2XY224HPtn0XPLeX3V6jXrozaGi394SwMjWteGQ4ew49uHN+SzUyr1z+TvXyjZFsC2A/xs9bKt10x1ETENmAbQ0dER7ZwOsDftngKxnlyWteFyrA2X4/r2m/oLlr5Qfiz26E03Yc6UA9asd3V1cVQTleOYKbNYPLWzIZ9dsVKPiLEFffYSYKuS9S2BZwv6LDMzawKVKu3Rm27M4qmH1TGi9lKxUpe0EXAisD9ZS/qXwGUR8eoAP3smMFnSDLIBcssiYr2udzMzax9LX3jFlXaBUrrfryHLKHdxvn40cC1wZLmTJF0PdAKbS1pCNiHMBgARcRkwm+xxtifIHmn7dPXhm5lZM0lpiVtxUir1HSNit5L1OyU9VOmkiDi6wv5g3fv0ZmbW5Nx93txSKvUHJO2TP3aGpL2BOcWGZWZmjeBKu7WlVOp7A5+U9Ey+vjXwmKQFZA3uXQuLzszM6sr3vFtbSqV+cOFRmJmZ2YClPNL2dD0CMTOz4nkgW3tLaambmVmbcPd6e3OlbmbWRtwSH9xcqZuZtRCPTrdy+qzUJS2nj1zsABExopCIzMysT+4+t3L6rNQjYjiApLOB58iyyAk4FmjM9DNmZm3O3ec2ECnd7x+KiL1L1i+V9Gvg3IJiMjMbtNwSt4FIqdRXSToWmEHWHX80sKrQqMzM2lDPVvhpu6xk0pRZ6xzjlrgNREqlfgxwYf4KshSxxxQZlJlZO+rZCu/q6mLxsZ2NC8jaTkrymcXAEcWHYmbW2nw/3BotZT71HYBLgVERsbOkXYHxEfHtwqMzM2shvh9ujZbS/X4FcDpwOUBEPCzpOsCVupkNKm6JW7NLqdTfEhH3SyrdtrKgeMzMmpZb4tbsUir1P0najjwRjaQJwB8KjcrMrAHcErdWl1KpfwGYBrxb0lLgKbIENGZmbcUtcWt1KZV6RMSBkjYBhkTEckljiw7MzKzW3BK3dpdSqd8E7BERL5VsuxHYs5iQzMyK4Za4tbtyE7q8G3gPMFLSR0t2jQA2KjowM7NquSVug125lvqOwIeBTYHDS7YvBz5fZFBmZv3hlrgNduVmabsFuEXSvhFxbx1jMjPrlVviZuWl3FN/QNIXyLri13S7R8RnCovKzKwXbomblZdSqV8L/Bb4EHA22eNsjxUZlJkNTm6Jmw1MSqX+rog4UtIRETE9TxF7e9GBmdng45a42cCkVOpv5O8vSNoZeA4YU1hEZta23BI3K1ZKpT5N0mbAPwEzgWHAmYVGZWZtyS1xs2KlzKd+Zb54F7BtseGYWauq1AoHt8TNipYyn/qmwCfJutzXHB8RJxcXlpm1GrfCzRovpft9NnAfsABYXWw4ZmZm1l8plfpGEXFq4ZGYWVMr7V4/bZeVTJoya5397lo3a7yk59QlfR64DXite2NE/KXSiZIOBi4EhgJXRsTUHvu3BqaTpaIdCkyJiNnp4ZtZvZR2r3d1dbH42M7GBmRm60mp1F8HzgO+BkS+LagwaE7SUOAS4CBgCTBX0syIWFhy2NeBH0fEpZJ2IuvqH1PVNzCzmvDjZmatL6VSP5UsAc2fqrz2XsATEfEkgKQZwBFAaaUeZLO+AYwEnq3yM8ysRjzQzaz1KSLKHyDNBCZGxMtVXViaABwcEZ/L1z8B7B0Rk0uOeQfwM2AzYBPgwIiY38u1jgeOBxg1atSeM2bMqCaUlrdixQqGDRvW6DDawmAuy0XPLef1VX2Pdd1w6BB2fPvwpGsN5nKsJZdjbTRbOS5YuoxdRo+s2fXGjRs3PyI6Uo5NaamvAh6UdCfr3lOv9EibetnW8y+Io4GrI+J8SfuS3b/fOSLW+Z8nIqYB0wA6Ojqis7MzIez20dXVxWD7zkUZzGU5acosFk89vPKBCQZzOdaSy7E2mq0cJ02Z1bAxJymV+k/yV7WWAFuVrG/J+t3rnwUOBoiIeyVtBGwOPN+PzzMzMxvUUjLKTe/ntecC20saCywFJgLH9DjmGeAfgKsl/S3Z1K5/7OfnmQ1qHuhmZn1W6pJ+HBFHSVrA+t3mRMSu5S4cESslTSab0W0ocFVEPCrpbGBeRMwETgOukHRK/hmTotJNfjPrlQe6mVm5lvoX8/cP9/fi+TPns3tsO7NkeSGwX3+vbzaYuCVuZpX0WalHxB/yxRMj4qul+ySdA3x1/bPMrChuiZtZJSkD5Q5i/Qr8kF62mdkAuCVuZgNV7p76PwInAttJerhk13BgTtGBmQ02bomb2UCVa6lfB/wU+FdgSsn25Sl5383MzKy+yt1TXwYsk/R14LmIeE1SJ7CrpGsi4oV6BWnWDty9bmZFS7mnfhPQIeldwA+AmWSt+EOLDMys3bh73cyKNiThmNURsRL4KPBvEXEK8I5iwzIzM7NqpbTU35B0NPBJoDtx9AbFhWTWeip1rYO7182seCmV+qeBE4B/join8rSv/1FsWGatxV3rZtYMUnK/L5T0VWDrfP0pYGrRgZmZmVl1Klbqkg4HvgNsCIyV9F7g7IgYX3RwZs3CI9fNrBWkdL+fBewFdAFExIN5F7zZoOHudTNrBSmV+sqIWCapdJtnUrO24pa4mbWDlEr9EUnHAEMlbQ+cDPyq2LDM6sstcTNrBynPqZ8EvAd4jSzpzDLgS0UGZWZmZtVLGf3+MvC1/GXWkrq710/bZSWTpsxab7+7182sHaR0v5u1vO7u9a6uLhYf29nocMzMCpHS/W5mZmYtIOU59f0iYk6lbWaN5NHrZmZp3e8XA3skbDNrGI9eNzMrU6lL2hd4H7CFpFNLdo0AhhYdmJmZmVWnXEt9Q2BYfszwku0vAhOKDMqsJ3evm5lV1melHhF3AXdJujoingaQNAQYFhEv1itAM3D3uplZipTR7/8qaYSkTYCFwCJJpxccl5mZmVUpZaDcThHxoqRjgdnAV4H5wHmFRmaDirvXzcwGLqVS30DSBsBHgO9FxBuSPKGL1ZS7183MBi6l+/1yYDGwCXC3pG3IBsuZmZlZE0nJ/X4RcFHJpqcljSsuJGtH7l43MyteSka5UcC/AO+MiEMk7QTsC/yg6OCsfbh73cyseCnd71cDtwPvzNcfx1OvmpmZNZ2USn3ziPgxsBogIlYCqwqNyszMzKqWUqm/JOltQABI2gdYVmhUZmZmVrWUSv1UYCawnaQ5wDXASSkXl3SwpEWSnpA0pY9jjpK0UNKjkq5LjtzMzMzWkTL6/TeSPgDsCAhYFBFvVDpP0lDgEuAgYAkwV9LMiFhYcsz2wBnAfhHxV0l/08/vYWZmNuilJJ8B2AsYkx+/hyQi4pqEc56IiCcBJM0AjiBLNdvt88AlEfFXgIh4vorYrYn4kTUzs8ZLeaTtWmA74EHWDpALsm74ckYDvy9ZXwLs3eOYHfLPmEM2netZEfFflcO2ZuNH1szMGk8R5TO+SnqMLP97ValhJR0JfCgiPpevfwLYKyJOKjnmNuAN4ChgS+CXwM4R8UKPax0PHA8watSoPWfMmFFNKC1vxYoVDBs2rNFhlLVg6TJ2GT2y0WFU1Apl2QpcjrXhcqyNZivHWv9/OG7cuPkR0ZFybEr3+yPA24E/VBnHEmCrkvUtgWd7Oea+/B79U5IWAdsDc0sPiohpwDSAjo6O6OzsrDKU1tbV1UWzf+dJU2ax+NjORodRUSuUZStwOdaGy7E2mq0cG/n/YZ+VuqRbybrZhwMLJd0PvNa9PyLGV7j2XGB7SWOBpcBE4Jgex/wEOBq4WtLmZN3xT1b7JczMzKx8S/07A7lwRKyUNJksG91Q4KqIeFTS2cC8iJiZ7/ugpIVk9+tPj4g/D+RzzczMBqs+K/WIuGugF4+I2WRzsJduO7NkOciegz91oJ9lZmY22KWMfl9Onk2uxDJgHnBa9yNrZmZm1lgpA+UuIBvgdh1Z8pmJZAPnFgFXAZ1FBWdmZmbpUtLEHhwRl0fE8oh4MR+JfmhE/AjYrOD4zMzMLFFKS321pKOAG/P1CSX7qnp23VpTpWxx4IxxZmbNIKVSPxa4EPg+WSV+H3CcpI2ByQXGZk3C2eLMzFpDyoQuTwKH97H7ntqGY2ZmZv1VLvnMVyLiXEkX00s3e0ScXGhkZmZmVpVyLfXH8vd59QjEzMzMBqZc8plb8/fp9QvHzMzM+isl+cwOwJdZO586ABFxQHFhmZmZWbVSRr/fAFwGXMna+dTNzMysyaRU6isj4tLCIzEzM7MBKTf6/a354q2STgRuZt2pV/9ScGxmZmZWhXIt9flkj7IpXz+9ZF8A2xYVlJmZmVWv3Oj3sfUMxBqnUhpYp4A1M2sNKffUrc05DayZWXtImaXNzMzMWoArdTMzszZRsVKXtJ+kTfLl4yRdIGmb4kMzMzOzaqS01C8FXpa0G/AV4GngmkKjMjMzs6qlVOorIyKAI4ALI+JCYHixYZmZmVm1Uka/L5d0BnAc8H5JQ4ENig3LzMzMqpXSUv84WSa5z0bEc8Bo4LxCozIzM7OqVWyp5xX5BSXrz+B76mZmZk2nXO73eyJif0nLydLCrtkFRESMKDw6MzMzS1YuTez++bsHxZmZmbWApDSx+eC4UaXH593wZmZm1iQqVuqSTgK+AfwvsDrfHMCuBcZlZmZmVUppqX8R2DEi/lx0MFYMz8JmZjY4pFTqvweWFR2IFcezsJmZDQ7lRr+fmi8+CXRJmkX2vDoAEXFBryeamZlZQ5RrqXePen8mf22Yv2DdR9zMzMysCZR7pO2bAJKOjIgbSvdJOrLowMzMzKw6KWliz0jcth5JB0taJOkJSVPKHDdBUkjqSLmumZmZra/cPfVDgEOB0ZIuKtk1AlhZ6cL5s+2XAAcBS4C5kmZGxMIexw0HTgZ+XX34ZmZm1q1cS/1ZYB7wKjC/5DUT+FDCtfcCnoiIJyPidWAG2fStPX0LODf/HDMzM+snZVOllzlA2iAi3qj6wtIE4OCI+Fy+/glg74iYXHLM7sDXI+JjkrqAL0fEvF6udTxwPMCoUaP2nDFjRrXhtLQVK1YwbNiwfp+/YOkydhk9soYRta6BlqVlXI614XKsjWYrx1r/nztu3Lj5EZF0ezpllraqK/Scervcmp3SEOC7wKSEGKYB0wA6Ojqis7OznyG1pq6uLgbynSdNmcXiY/t/fjsZaFlaxuVYGy7H2mi2cmzk/7kpA+X6awmwVcn6lmRd+t2GAzuTPQO/GNgHmOnBcmZmZv1TZKU+F9he0lhJGwITye7HAxARyyJi84gYExFjgPuA8b11v5uZmVllKRO67ACcDmzDurO0HVDuvIhYKWkycDswFLgqIh6VdDYwLyJmljvfzMzMqpOS+/0G4DLgCmBVNRePiNnA7B7bzuzj2M5qrm1recIWMzODtEp9ZURcWngk1m+esMXMzKB88pm35ou3SjoRuJl1J3T5S8GxmZmZWRXKtdTnkz2C1v1o2ukl+wLYtqigzMzMrHrlJnQZCyBpo4hYJ9ubpI2KDszMzMyqk/JI268St5mZmVkDlbun/nZgNLBxns61uxt+BPCWOsRmZmZmVSh3T/1DZClctwQuKNm+HPj/BcZkZmZm/VDunvp0YLqkj0XETXWMyczMzPqhXPf7cRHxH8AYSaf23B8RF/RympmZmTVIue73TfL35pnPzszMzPpUrvv98nzxnJ6PtJmZmVnzSUkT+4ik/wV+CdwNzImIZcWGZWZmZtWq+Jx6RLwLOBpYAHwYeEjSg0UHZmZmZtVJmXp1S2A/4O+B3YBHgXsKjstKLHpuOZOmzOpzv2dhMzMzSOt+fwaYC/xLRJxQcDzWi9dXrWbx1MMbHYaZmTW5lDSxuwPXAMdIulfSNZI+W3BcZmZmVqWKLfWIeEjS/wD/Q9YFfxzwfuAHBcdmZmZmVUi5pz4PeDPZJC73AO+PiKeLDszMzMyqk3JP/ZCI+GPhkZiZmdmApDzS5grdzMysBaQMlDMzM7MW4ErdzMysTVSs1CUdKWl4vvx1Sf8paY/iQzMzM7NqpLTU/ykilkvaH/gQMB24tNiwzMzMrFoplfqq/P0w4NKIuAXYsLiQzMzMrD9SHmlbKuly4EDgHElvxvfia2q/qb9g6Quv9Ln/jPe6uM3MrLKUSv0o4GDgOxHxgqR3AKcXG9bgsvSFV1g89bA+93d1ddUvGDMza1kpz6m/DDwP7J9vWgn8rsigzMzMrHopo9+/AXwVOCPftAHwH0UGZWZmZtVLuVn7/4DxwEsAEfEsMLzIoMzMzKx6KZX66xERQABI2qTYkMzMzKw/Uir1H+ej3zeV9Hngv4Erig3LzMzMqpUyUO47wI3ATcCOwJkRcXHKxSUdLGmRpCckTell/6mSFkp6WNLPJW1T7RcwMzOzTMojbUTEHcAd1VxY0lDgEuAgYAkwV9LMiFhYctgDQEdEvCzpH4FzgY9X8zlmZmaW6bNSl7Sc7D668vc1u4CIiBEVrr0X8EREPJlfbwZwBLCmUo+IO0uOvw84rqrozczMbA1lY+AKuLA0ATg4Ij6Xr38C2DsiJvdx/PeA5yLi273sOx44HmDUqFF7zpgxo5CYG2XB0mXsMnpkn/tXrFjBsGHD6hhR+3JZ1obLsTZcjrXRbOVY6f/0ao0bN25+RHSkHFux+13SPsCjEbE8Xx8GvCcifl3p1F629foXhKTjgA7gA73tj4hpwDSAjo6O6OzsrBR2S5k0ZRaLj+3sc39XVxft9p0bxWVZGy7H2nA51kazlWOl/9OLlDL6/VJgRcn6y6TN0rYE2KpkfUvg2Z4HSToQ+BowPiJeS7iumZmZ9SJloJyipI8+IlZLSjlvLrC9pLHAUmAicMw6F5Z2By4n66Z/Pj3s1lJpwpbRm25cx2jMzKxdpVTOT0o6mbWt8xOBJyudFBErJU0GbgeGAldFxKOSzgbmRcRM4DxgGHCDJIBnImJ8P75HU6s0YYuZmVktpFTqJwAXAV8nuyf+c/JBa5VExGxgdo9tZ5YsH5gcqZmZmZVVsVLPu8Un1iEWMzMzG4Byz6l/JSLOlXQxvYxaj4iTC43MzMzMqlKupf5Y/j6vHoGYmZnZwPRZqUfErfniyxFxQ+k+SUcWGpWZmZlVLeU59TMSt5mZmVkDlbunfghwKDBa0kUlu0YAK4sOzMzMzKpT7p76s2T308cD80u2LwdOKTIoMzMzq165e+oPSXoE+GBETK9jTGZmZtYPZe+pR8Qq4G2SNqxTPGZmZtZPKRnlngbmSJoJvNS9MSIuKCyqFuPc7mZm1gxSKvVn89cQYHix4bQm53Y3M7NmkJIm9pv1CMTMzMwGpmKlLmkL4CvAe4CNurdHxAEFxmVmZmZVSkk+80Pgt8BY4JvAYrK50s3MzKyJpFTqb4uIHwBvRMRdEfEZYJ+C4zIzM7MqpQyUeyN//4Okw8gGzW1ZXEhmZmbWHymV+rcljQROAy4mSxPrjHJmZmZNJmX0+2354jJgXLHhmJmZWX+l3FM3MzOzFuBK3czMrE2Um3r1ixFxoaT9ImJOPYNqJpVSwILTwJqZWXMod0/908CFZIPj9qhPOM3HKWDNzKxVlKvUH5O0GNhC0sMl2wVEROxaaGRmZmZWlXLzqR8t6e3A7cD4+oVkZmZm/VH2kbaIeA7YLZ9PfYd886KIeKPMaWZmZtYAKRO6fAC4hiznu4CtJH0qIu4uODYzMzOrQkpGuQuAD0bEIgBJOwDXA3sWGZiZmZlVJ+U59Q26K3SAiHgc2KC4kMzMzKw/Ulrq8yT9ALg2Xz8WmF9cSGZmZtYfKZX6PwJfAE4mu6d+N/D9IoMyMzOz6qVM6PIa2X31C4oPx8zMzPorpaXe1iqlgXUKWDMzaxWDvlJ3GlgzM2sXhc7SJulgSYskPSFpSi/73yzpR/n+X0saU2Q8ZmZm7Swl+cwOwOnANqXHR8QBFc4bClwCHAQsAeZKmhkRC0sO+yzw14h4l6SJwDnAx6v+FmZmZpbU/X4DcBlwBbCqimvvBTwREU8CSJoBHAGUVupHAGflyzcC35OkiIgqPsfMzMxIq9RXRsSl/bj2aOD3JetLgL37OiYiVkpaBrwN+FPpQZKOB47PV1dIWkQN6ZxaXq0Qm9OjTKzfXJa14XKsDZdjbTRdOda4Xtkm9cCUSv1WSScCNwOvdW+MiL9UOE+9bOvZAk85hoiYBkyr8HltS9K8iOhodBztwGVZGy7H2nA51obLca2USv1T+fvpJdsC2LbCeUuArUrWtwSe7eOYJZLeBIwEKv2xYGZmZr1IST4ztp/XngtsL2kssBSYCBzT45iZZH803AtMAH7h++lmZmb9kzL6fQOyVLHvzzd1AZdXmlM9v0c+GbgdGApcFRGPSjobmBcRM4EfANdKeoKshT6x39+kvQ3aWw8FcFnWhsuxNlyOteFyzKlSw1jSlWSzsk3PN30CWBURnys4NjMzM6tCSqX+UETsVmmbmZmZNVZKRrlVkrbrXpG0LdU9r25mZmZ1kFKpnw7cKalL0l3AL4DTig1rcKqUVrfkuAmSQpIf4ehFQnriSZL+KOnB/OVbSb1I+X2UdJSkhZIelXRdvWNsBQm/j98t+V18XNILjYizFSSU5daS7pT0gKSHJR3aiDgbqWL3O2Q52oEdyZ4r/20+HavVUJ5W93FK0uoCR/dIq4uk4cAsYENgckTMq3eszSylHCVNAjoiYnJDgmwBieW4PfBj4ICI+Kukv4mI5xsScJNK/XddcvxJwO4R8Zn6RdkaEn8npwEPRMSlknYCZkfEmEbE2yh9ttQlHZC/fxQ4DHgXsB1wWL7NamtNWt2IeB3oTqvb07eAc4FX6xlcC0ktRysvpRw/D1wSEX8FcIXeq2p/H48Grq9LZK0npSwDGJEvj2T93Chtr1z3+wfy98N7eX244LgGo97S6o4uPUDS7sBWEXFbPQNrMRXLMfexvHvuRklb9bJ/sEspxx2AHSTNkXSfpIPrFl3rSP19RNI2wFiyW5y2vpSyPAs4TtISYDZwUn1Cax59PqceEd/IF8+OiKdK9+UJZay2yqbMlTQE+C4wqV4BtaiU1MO3AtdHxGuSTiB7XLPsrIODUEo5vgnYHugkyxj5S0k7R4TvCa+VlAo7NxG4MSI8ELl3KWV5NHB1RJwvaV+yPCg7R8Tq4sNrDikD5W7qZduNtQ7EKqbVHQ7sDHRJWgzsA8z0YLn1VExPHBF/LhkXcgWwZ51iayWpaZ5viYg38j/8F5FV8rZWSjl2m4i73stJKcvPko3zICLuBTYim+xl0Ch3T/3dkj4GjJT00ZLXJLKCstpak1ZX0oZk/8Bndu+MiGURsXlEjMmAgiCMAAAHx0lEQVQHftwHjPdAufWULUcASe8oWR0PPFbH+FpFxXIEfgKMA5C0OVl3/JN1jbL5pZQjknYENiNLmW29SynLZ4B/AJD0t2R11R/rGmWDlUsTuyPZvfNNye6jd1tONkDGaigxra5VkFiOJ0saD6wkS088qWEBN6nEcrwd+KCkhWS5K06PiD83LurmU8W/66OBGZ77om+JZXkacIWkU8i65icNtjJNySi3b96NYWZmZk0s5Z76CZI27V6RtJmkqwqMyczMzPohpVLftXQ0a/5M6u7FhWRmZmb9kVKpD5G0WfeKpLeSMGWrmZmZ1VdK5Xw+8CtJ3Y+xHQn8c3EhmZmZWX+k5n5/D9mjKwJ+3lfeYjMzM2uclO53IuJRsgf6bwFWSNq60KjM6kzSih7rkyR9L18+QdInK5y/5vgBxvGRfCKKgV5njKRjStY7JF000Ovm15ok6Z39iOeRWnx+L9denD8nn3r8WZK+3Mv2d3b3SErqlHRbvjy+e0awWv18zIpSsVLPf6F/BzwF3AUsBn5acFxmTSMiLouIa+r0cR8BalFpjAHWVOoRMS8iTq7BdSF7rr+qSn2g8hm6ChURz0bEhF62z4yIqflqrX4+ZoVIaal/iywl6eMRMZYsW8+cQqMyayKlLTtJf5dPBHOvpPN6tD7fKem/JP1O0rkl538wP/43km6QNCzfPlXZXOQPS/qOpPeRZbg7T9nc2tv1iGMLSTdJmpu/9su3f0Br5+N+QNn0vFOBv8+3ndKj5XmWpOmSfpa3cj8q6VxJC/L4N8iPOzP/nEckTVNmAtAB/DC/9saS9pR0l6T5km5XnrEv3/6QpHuBL/RRtp2S7pZ0c14Wlymb5wBJKySdLenXwL6S/iH/fgskXaVsSuhup0u6P3+9Kz//cEm/zs/5b0mjSo7fTdIv8p/V5/Pje+1N6O6F6e3nI+k3JcdtL2l+r79EZvUSEWVfZJl6AB4ChuTL91c6zy+/WulFlhHtwZLXM8D38n1nAV/Olx8B3pcvTwUeyZcnkaVIHUmWmvJpsjzVmwN3A5vkx30VOBN4K1mu9O5xLZvm71cDE/qI8Tpg/3x5a+CxfPlWYL98eRjZANhO4LaSc9es59/nHmADYDfgZeCQfN/NwEfy5beWnH8tcHi+3EU2Hz35NX4FbJGvf5ws0xfAw8AH8uXzusuqx3fqJJtGeFuyLGF3dH9/soxgR+XLG5HN0LVDvn4N8KV8eTHwtXz5kyXfc7OS8v0ccH7J938I2Dj/+fyerOdhTMnPs7S8JrH2d2Gdnw9wJ/DefPlfgJMa/bvs1+B+pYx+fyFvWdxN9tf582TpNc3aySsR8d7uFWVzHKwzWY6yJEzDI+JX+abrWHca4p9HxLL82IXANmRplncC5kgC2JAsv/eLZJXZlZJmASnT6R4I7JRfB2BE3iqfA1wg6YfAf0bEkpJj+vLTiHhD0gKyyvS/8u0LyCo3gHGSvgK8heyPkEfJ/oAotSPZREN35J85FPiDpJFkf6jclR93LXBIH7HcHxFPAki6HtifbNKoVaydUGpH4KmIeDxfn07W+v+3fP36kvfv5stbAj/Kew42JLuF2O2WiHgFeEXSnWRzdT/YR3zlXAl8WtKpZH/Q7NWPa5jVTEqlfgTwCnAKcCxZS+TsIoMya1KVasrXSpZXkf37EnBHRBy93sWkvchuZ00EJlN5+tchwL55ZVRqav6HwaHAfZIOrHCdNbFGxGpJb0RE92Mwq4E3SdoI+D5Zi/z3ks6i94mcBDwaEfv2+G6b0vcUoz31PK57/dVYOw1ppbKPXpYvBi6IiJmSOsla6JU+s1o3Ad8gmwN9fjj3vTVY2Xvqygan3BIRqyNiZURMj4iL/Itrg1Fk2RSXS9on3zQx4bT7gP1K7vO+RdIOee/XyIiYDXwJ6O4lWE42zW5vfkZW+ZNf6735+3YRsSAizgHmAe+ucJ0U3RX4n/JYSweQlV57EbCFsrmrkbSBpPdEloVymaT98+OOLfNZeymbeWsIWWv3nl6O+S0wprscgU+QDdzt9vGS9+65KkYCS/PlT/W43hGSNpL0NrKu9rll4iu1TrlGxKtkE4xcCvx74jXMClO2Us//Sn4570ozs2y+5mn54C8By8odHBF/JLsne72kh8kq+XeTVQy35dvuIusJA5hBNujrAfUYKAecDHQoG1i3EDgh3/6lfDDbQ2S9aj8lu5+9Mh+odgpVyivlK8i643/CupXe1cBlkh4k626fAJyTf/6DwPvy4z4NXJKXVc/ehVL3ko9PIOsiv7mXeF7Nr3dDfstgNXBZySFvzgfUfZG1ZXlWfvwvgT/1uOT9wCyyn8e3IqKvOc576u3n80Oylv7PEq9hVpiUWdp+TDb6/Q7gpe7tUbvHY8xahqRhEbEiX54CvCMivtjgsFpW3i3+5Yj4cKVjm5WyJyNGRsQ/NToWs5R76rPyl5nBYZLOIPu38zSei31Qk3QzsB2Vx0OY1UWfLXVJW0fEM3WOx8zMzPqp3D31n3QvSLqpzHFmZmbWBMpV6qWPkGxbdCBmZmY2MOUq9d6e+zQzM7MmVe6e+iqy0e4iS6f4cvcuICJiRF0iNDMzsyRJ86mbmZlZ80uaT93MzMyanyt1MzOzNuFK3czMrE24UjczM2sTrtTNzMzaxP8BkqowbLfYJNQAAAAASUVORK5CYII=\n",
"text/plain": [
"<Figure size 576x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"# sort list in descending order\n",
"maxima.sort(reverse=True)\n",
"\n",
"# convert list to array\n",
"probas = np.asarray(maxima)\n",
"\n",
"mu = 200\n",
"sigma = 25\n",
"n_bins = 50\n",
"\n",
"fig, ax = plt.subplots(figsize=(8, 4))\n",
"\n",
"# plot the cumulative histogram\n",
"n, bins, patches = ax.hist(probas, n_bins, density=1, histtype='step',\n",
" cumulative=True, facecolor='darkred')\n",
"\n",
"ax.grid(True)\n",
"ax.set_xlabel('Highest estimated probability')\n",
"ax.set_ylabel('Fraction of articles with this highest estimated probability')\n",
"#plt.axis([0.49, 1, 0, 0.015])\n",
"#ax.set_xbound(lower=0.5, upper=0.99)\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": 141,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<Figure size 432x288 with 0 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"plt.savefig('..\\\\visualization\\\\probabilities_after_round_{}_svm.png'.format(m))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We annotate each article's estimated class with its probability in columns 'Estimated' and 'Probability':"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"# series of indices of recently estimated articles \n",
"indices_estimated = df.loc[df['Label'] == -1, 'Index'].tolist()"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"8878"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(indices_estimated)"
]
},
{
"cell_type": "code",
"execution_count": 144,
"metadata": {},
"outputs": [],
"source": [
"n = 0 \n",
"for row in class_probs:\n",
" for i in range(0, len(classes)):\n",
" index = indices_estimated[n]\n",
" # save estimated label\n",
" if np.amax(row) == row[i]:\n",
" df.loc[index, 'Estimated'] = classes[i]\n",
" # annotate probability\n",
" df.loc[index, 'Probability'] = row[i]\n",
" n += 1"
]
},
{
"cell_type": "code",
"execution_count": 125,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([2., 2., 0., 1., 0., 0., 0., 0., 2., 0.])"
]
},
"execution_count": 125,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#predictions_test[:10]"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
"# without probabilities:\n",
"n = 0 \n",
"for row in predictions_test:\n",
" index = indices_estimated[n]\n",
" # save estimated label\n",
" df.loc[index, 'Estimated'] = row\n",
" # annotate probability\n",
" df.loc[index, 'Probability'] = np.nan\n",
" n += 1"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"8878"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"n"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Estimated labels in total: class 0: 5858, class 1: 1143, class 2: 1877\n"
]
}
],
"source": [
"print('Estimated labels in total: class 0: {}, class 1: {}, class 2: {}'\n",
" .format(len(df.loc[(df['Label'] == -1) & (df['Estimated'] == 0.0)]), \n",
" len(df.loc[(df['Label'] == -1) & (df['Estimated'] == 1.0)]),\n",
" len(df.loc[(df['Label'] == -1) & (df['Estimated'] == 2.0)])))"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"# save round\n",
"df.to_csv('../data/interactive_labeling_round_{}_svm_without_proba.csv'.format(m),\n",
" sep='|',\n",
" mode='w',\n",
" encoding='utf-8',\n",
" quoting=csv.QUOTE_NONNUMERIC,\n",
" quotechar='\\'')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Manual Labeling: ##"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Find new threshold for labeling:"
]
},
{
"cell_type": "code",
"execution_count": 153,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"10"
]
},
"execution_count": 153,
"metadata": {},
"output_type": "execute_result"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Number of articles with estimated probability < 0.354: 10\n"
]
}
],
"source": [
"threshold = 0.354 # mnb 0.582 #0.3429 # mnb: 0.62 # svm: 0.3511\n",
"\n",
"n = 0\n",
"for ma in maxima:\n",
" if ma < threshold:\n",
" n += 1\n",
"n\n",
"print('Number of articles with estimated probability < {}: {}'.format(threshold, len(df.loc[(df['Label'] == -1) & (df['Probability'] < threshold)])))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Check articles with probability under threshold:"
]
},
{
"cell_type": "code",
"execution_count": 154,
"metadata": {},
"outputs": [],
"source": [
"# pick articles with P < t:\n",
"label_next = df.loc[(df['Label'] == -1) & (df['Probability'] < threshold), 'Index'].tolist()"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [],
"source": [
"# pick articles randomly:\n",
"num_rand = 10\n",
"label_next = df[(df['Label'] == -1)].sample(n=num_rand, random_state=random_state)\n",
"label_next = label_next['Index'].tolist()"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"list"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"type(label_next)"
]
},
{
"cell_type": "code",
"execution_count": 156,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Estimated labels to be checked: class 0: 3, class 1: 4, class 2: 3\n"
]
}
],
"source": [
"print('Estimated labels to be checked: class 0: {}, class 1: {}, class 2: {}'\n",
" .format(len(df.loc[(df['Label'] == -1) & (df['Probability'] < threshold) & (df['Estimated'] == 0.0)]), \n",
" len(df.loc[(df['Label'] == -1) & (df['Probability'] < threshold) & (df['Estimated'] == 1.0)]),\n",
" len(df.loc[(df['Label'] == -1) & (df['Probability'] < threshold) & (df['Estimated'] == 2.0)])))"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"15"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"m"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"This round number: 16\n"
]
}
],
"source": [
"# increment round number\n",
"m += 1\n",
"print('This round number: {}'.format(m))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"PLEASE READ THE FOLLOWING ARTICLES AND ENTER THE CORRESPONDING LABELS:"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"News article no. 8297.0:\n",
"\n",
"HEADLINE:\n",
"8297 Exclusive: Airbus defense unit freezes capex, may miss cash goals - memo\n",
"Name: Title, dtype: object\n",
"\n",
"TEXT:\n",
"8297 October 5, 2017 / 8:51 AM / in 3 hours Exclusive: Airbus defense unit freezes capex, may miss cash goals - memo Tim Hepher 4 Min Read FILE PHOTO - An aerial view of an Airbus A400M aircraft during the 52nd Paris Air Show at Le Bourget Airport near Paris, France, June 21, 2017. Picture taken June 21, 2017. REUTERS/Pascal Rossignol/File Photo PARIS (Reuters) - Airbus Defence and Space has frozen capital spending and urged its 34,000 staff to take drastic measures to save cash as it faces the prospect of missing 2017 cash targets by hundreds of millions of euros, according to a memo seen by Reuters. With the risk of missing our full-year cash targets by hundreds of millions, we need to do something extraordinary together, divisional finance chief Julian Whitehead told an internal forum, according to a summary distributed to staff. Airbus ( AIR.PA ) has said it expects 2017 group-wide free cashflow to be similar to 2016, before mergers and acquisitions and customer financing. It does not publish cash targets for divisions. Due to the bumpy patterns of cashflows in aerospace, it often faces a dash to meet targets in the fourth quarter. Airbus Defence & Space, which has warned of continued cash pressures from the troubled A400M military aircraft program, plans to set up a Cash Crisis team to improve the situation by end-year, with all its programs expected to participate. Until those plans become clear, all capital expenditure is being frozen with immediate effect across all the divisions activities and across all its subsidiaries, the memo said. Airbus shares stumbled from record highs and fell as much as 1.6 percent. They were down 1.3 percent by 0945 GMT, making the stock the worst performer on France's benchmark CAC-40 .FCHI equity index. Airbus stock price remains up around 30 percent since the start of 2017 on buoyant demand for passenger jets, although rival Boeings ( BA.N ) shares are up 64 percent. Asked to comment on the memo, an Airbus spokesman said: We are currently in the traditional year-end race in the commercial and government business. FILE PHOTO: An Airbus A400M aircraft flies during a display on the first day of the 52nd Paris Air Show at Le Bourget airport near Paris, France, June 19, 2017. Picture taken June 19, 2017. REUTERS/Pascal Rossignol/File Photo He added: It is key to remind our troops at this important time of a business year on the importance of meeting our cash objectives. Thats the current ongoing effort at Airbus and it is rather standard procedure to achieve our quarterly and yearly divisional targets at Airbus Defence and Space without deviation. ALSO FACING PRESSURE FROM A400M DELAYS Another person close to the group said the language used in the memo was typical of the purely internal battle cry used by managers at this time of year to focus on reaching targets. On Wednesday, however, Airbus reminded European governments that the delayed A400M would continue to weigh significantly on cashflow in 2017 and 2018, especially. It has been squeezed as Germany withholds some 15 percent in cash owed for the transport plane because of what it regards as systems failing to do what Airbus had promised. The company earlier this year entered talks with buyer nations to try to ease the penalties and get a new agreement on schedules. At a group level, cash and profits have further been hampered by delays in delivering A320neo jetliners because of delays in receiving engines from U.S. supplier Pratt & Whitney. Late deliveries delay payments from airlines and prevent workers learning through experience as quickly as planned, which drives up cost and eats up cash for inventory on assembly lines. Airbus as a whole had 7.9 billion euros of net cash at end-June, down from 11.1 billion at the end of 2016. While freezing spending, Airbus Defence & Space is also in the midst of a strategy overhaul that has involved selling its electronics activity and now puts faith in the growth of digital services to help it grow faster than the rest of the industry. Reporting by Tim Hepher; Editing by Sudip Kar-Gupta \n",
"Name: Text, dtype: object\n"
]
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "72070fc55d8b4a27bc3e42eea356ed8d",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"interactive(children=(IntSlider(value=0, description='x', max=2, min=-1), Output()), _dom_classes=('widget-int…"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"0: Other/Unrelated news, 1: Merger,\n",
"2: Topics related to deals, investments and mergers\n",
"___________________________________________________________________________________________________________\n",
"\n",
"\n",
"News article no. 482.0:\n",
"\n",
"HEADLINE:\n",
"482 BM&FBovespa says CME fully divests from bourse shares\n",
"Name: Title, dtype: object\n",
"\n",
"TEXT:\n",
"482 Big Story 10 25am EST BM&FBovespa says CME fully divests from bourse shares BRASILIA BM&FBovespa SA, Latin America's largest financial bourse, said on Friday that CME Group had fully divested its position in shares issued by the Brazilian bourse, but said the accords between both companies remained valid. \"The agreements between BM&FBovespa and the CME Group will remain valid and the companies will seek to continue to cooperate strategically in developing products, technology and other areas of mutual interest for both companies,\" BM&FBovespa said in a filing. (Reporting by Paula Arend Laier; Writing by Alonso Soto) Next In Big Story 10 Pakistan's Sindh province cracks down on child labor KARACHI, Pakistan (Thomson Reuters Foundation) - Pakistan's Sindh province has banned children under 14 from working, becoming the third region to limit child labor in a country where millions of minors work in sectors from brick making to carpet weaving, farming to mining.\n",
"Name: Text, dtype: object\n"
]
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "4e5e436f23e7437bafe92a85507c6fdd",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"interactive(children=(IntSlider(value=0, description='x', max=2, min=-1), Output()), _dom_classes=('widget-int…"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"0: Other/Unrelated news, 1: Merger,\n",
"2: Topics related to deals, investments and mergers\n",
"___________________________________________________________________________________________________________\n",
"\n",
"\n",
"News article no. 8248.0:\n",
"\n",
"HEADLINE:\n",
"8248 AstraZeneca plans new pivotal lung cancer trial with Incyte\n",
"Name: Title, dtype: object\n",
"\n",
"TEXT:\n",
"8248 October 31, 2017 / 8:39 AM / Updated 7 hours ago AstraZeneca plans new pivotal lung cancer trial with Incyte Reuters Staff 2 Min Read LONDON (Reuters) - AstraZeneca ( AZN.L ) is stepping up its bet on immunotherapy combination treatments to fight lung cancer by signing a deal with Incyte ( INCY.O ) under which the two companies will start a final-stage Phase III clinical trial next year. FILE PHOTO - The logo of AstraZeneca is seen on medication packages in a pharmacy in London, Britain April 28, 2014. REUTERS/Stefan Wermuth The study will test AstraZenecas Imfinzi alongside Incytes second-generation immunotherapy drug epacadostat, a so-called IDO inhibitor that also helps the immune system fight cancer. Excitement has been building about epacadostat on the back of recent promising clinical data, and U.S.-based Incyte already has agreements with Merck & Co ( MRK.N ) and Bristol-Myers Squibb ( BMY.N ) for separate Phase III trials. Incytes decision to partner with multiple big drugmakers in this way has led Bernstein analyst Tim Anderson to describe it as a promiscuous company. In the case of the deal with AstraZeneca, the combination of Imfinzi and epacadostat will be tested in patients with relatively early, or stage III, lung cancer. As such, it builds on the success of Imfinzi alone in this setting. The Phase III trial will be co-funded by the two companies and conducted by AstraZeneca. It is expected to begin enrolling patients in the first half of 2018, the two groups said on Tuesday. Epacadostat works by blocking an enzyme that protects tumours from the immune system, while Imfinzi is one of five approved drugs known as PD-L1 or PD-1 inhibitors that block a different mechanism that cancer cells use to evade detection. Experts think the two could work well together without the added toxicity seen with other combinations because Imfinzi is systemic, while epacadostat works specifically at the tumour site. Reporting by Ben Hirschler, editing by Louise Heavens\n",
"Name: Text, dtype: object\n"
]
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "48f189fe7aa44d06a864b3f34f01bd6c",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"interactive(children=(IntSlider(value=0, description='x', max=2, min=-1), Output()), _dom_classes=('widget-int…"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"0: Other/Unrelated news, 1: Merger,\n",
"2: Topics related to deals, investments and mergers\n",
"___________________________________________________________________________________________________________\n",
"\n",
"\n",
"News article no. 3043.0:\n",
"\n",
"HEADLINE:\n",
"3043 German minister, labour reps welcome PSA work contract assurances for Opel merger\n",
"Name: Title, dtype: object\n",
"\n",
"TEXT:\n",
"3043 Deals - Wed Apr 5, 2017 - 1:06pm BST German minister, labor reps welcome PSA work contract assurances for Opel merger German Economy Minister Brigitte Zypries meets Chairman of the Managing Board of French carmaker PSA Group Carlos Tavares in Berlin, Germany, April 5, 2017. REUTERS/Fabrizio Bensch BERLIN Germany's economy minister said she had held constructive talks with PSA Chairman Carlos Tavares on Wednesday about the planned merger of the French group with Germany's Opel and felt reassured that existing labor deals would remain. Germany has welcomed the merger, provided the Opel brand stays independent and the merged group respects existing labor agreements, protects Opel sites and gives job guarantees. \"I particularly welcome the commitment by Mr Tavares to respect and continue all the collective agreements,\" said minister Brigitte Zypries in a statement. \"The federal government and federal states will continue to lend their constructive support to the process of merging PSA and Opel/Vauxhall,\" she added. Tavares said he had reaffirmed PSA's ambition to \"build on the quality of relations with employee representatives as a key factor of success of the company\". (Reporting by Madeline Chambers; Editing by Michelle Martin) Next In Deals Toshiba's Westinghouse fired chairman two days before bankruptcy filing TOKYO Westinghouse Electric Co LLC fired its chairman two days before the U.S. nuclear engineering unit of Toshiba Corp filed for bankruptcy last week, as the Japanese firm tries to draw a line under the travails of a business that has cost it billions. JAB Holding to buy bakery chain Panera Bread in $7.5 billion deal JAB Holdings, the owner of Caribou Coffee and Peet's Coffee & Tea, said on Wednesday it would buy U.S. bakery chain Panera Bread Co in a deal valued at about $7.5 billion, including debt, as it expands its coffee and breakfast empire. SYDNEY Blackstone Group has put an A$3.5 billion ($2.65 billion) shopping mall portfolio in Australia up for sale, said a source familiar with the matter, in what could be one of the country's largest ever real estate transactions. MORE FROM REUTERS From Around the Web Promoted by Revcontent Trending Stories\n",
"Name: Text, dtype: object\n"
]
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "0f948aa98c76418fb52c7c4ee31e7b77",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"interactive(children=(IntSlider(value=1, description='x', max=2, min=-1), Output()), _dom_classes=('widget-int…"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"0: Other/Unrelated news, 1: Merger,\n",
"2: Topics related to deals, investments and mergers\n",
"___________________________________________________________________________________________________________\n",
"\n",
"\n",
"News article no. 9063.0:\n",
"\n",
"HEADLINE:\n",
"9063 UK economy peps up, bolstering BoE rate hike call - PMI\n",
"Name: Title, dtype: object\n",
"\n",
"TEXT:\n",
"9063 November 3, 2017 / 11:57 AM / Updated 3 hours ago UK economy peps up, bolstering BoE rate hike call - PMI Andy Bruce , David Milliken 4 Min Read LONDON (Reuters) - Britains economy appears to be picking up speed, according to a survey on Friday that will reassure the Bank of England a day after it raised interest rates for the first time a decade. A man looks from a building in the financial district of Canary Wharf in London, Britain November 3, 2017. REUTERS/Kevin Coombs Sterling hit a days high against the dollar after the IHS Markit/CIPS services Purchasing Managers Index (PMI) jumped to 55.6 in October from 53.6 in September, its biggest one-month rise since August 2016. Despite nervousness among businesses about Brexit, the reading was its highest since April and exceeded all forecasts in a Reuters poll of economists. The survey of services businesses, which account for around 80 percent of British economic output, follows relatively upbeat PMI readings this week for the smaller manufacturing and construction sectors. Taken together they suggest the economy is growing at a quarterly rate of 0.5 percent, IHS Markit said, picking up from growth of 0.4 percent in the three months to September. Britains economy has lagged behind others in Europe and beyond this year as sterlings plunge following last years vote to leave the European Union pushes up inflation and uncertainty over the shape of Brexit causes businesses invest more slowly. The UK PMI may be starting to show some convergence with its firm global counterpart, JPMorgan economist Allan Monks said. Growth in the services sector outpaced that in the euro zone, as measured by a flash estimate, for the first time since January, the PMI showed. IHS Markit will publish a final estimate for the euro zone on Monday. The Bank of England will likely see Octobers (PMIs) as supportive to the decision to raise interest rates, said Howard Archer, chief economic adviser to the EY ITEM Club consultancy. Many private economists had warned before Thursdays decision by the BoE that a rate hike would be premature. However, serious uncertainties over the outlook evident among services companies fuels suspicion that it is likely to be some considerable time before the Bank of England hikes interest rates again, Archer said. The BoE raised rates for the first time in more than 10 years on Thursday and said its next increases would be very gradual. Deputy Governor Ben Broadbent said on Friday that the BoEs signal that it may need to raise interest rates two more times to bring down inflation was not a promise. Businesses are unsure about the outlook, and optimism among services companies remained well below its long-run average, fuelled mainly by uncertainty over Brexit, the PMI data showed. A deeper dive into the numbers highlights the fragility of the economy, said Chris Williamson, chief business economist at IHS Markit, which compiles the PMIs. BoE Governor Mark Carney said on Thursday that the central banks next move would be heavily influenced by the progress of talks on Britains departure from the EU. Growth could get a boost if a transitional deal gave businesses confidence to invest. But a failure to reach a deal would further weaken the pound and intensify inflation pressure. The services PMI, which covers non-retail businesses, said firms were putting up prices at the fastest rate since April. Costs increased rapidly, though at the slowest rate in just over a year, possibly tallying with the BoEs view that the inflationary effect of last years more than 10 percent fall in the value of the pound is starting to fade. Across the economy as a whole, the PMI showed that job creation was at its weakest since March. Squeezed margins and concerns about the economic outlook had led to more cautious hiring strategies, IHS Markit said. Editing by William Schomberg and Catherine Evans \n",
"Name: Text, dtype: object\n"
]
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "8c0cdc44080b428eb10c7c75c1f8c0b1",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"interactive(children=(IntSlider(value=0, description='x', max=2, min=-1), Output()), _dom_classes=('widget-int…"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"0: Other/Unrelated news, 1: Merger,\n",
"2: Topics related to deals, investments and mergers\n",
"___________________________________________________________________________________________________________\n",
"\n",
"\n",
"News article no. 1014.0:\n",
"\n",
"HEADLINE:\n",
"1014 Oil exports from southern Iraq down in January after OPEC deal - sources\n",
"Name: Title, dtype: object\n",
"\n",
"TEXT:\n",
"1014 Business News - Tue Feb 7, 2017 - 2:28pm GMT Oil exports from southern Iraq down in January after OPEC deal - sources FILE PHOTO - A flag with the Organization of the Petroleum Exporting Countries (OPEC) logo is seen before a news conference at OPEC's headquarters in Vienna, Austria, December 10, 2016. REUTERS/Heinz-Peter Bader/File Photo BASRA, Iraq Crude oil exports from southern Iraq in January fell to 3.275 million barrels per day (bpd) from 3.51 million bpd in December, as the country complied with an agreement with other producers to reduce output, two oil executives said on Tuesday. December's exports from the southern region, where Iraq produces most of its oil, set a record high. Iraq is OPEC's second-largest crude producer after Saudi Arabia. The group agreed in late November to cut production in order to support sagging oil prices. (Reporting by Aref Mohammed; writing by Maher Chmaytelli; editing by Jason Neely) Next In Business News\n",
"Name: Text, dtype: object\n"
]
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "91570e899edf45828876b3c3dd06688e",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"interactive(children=(IntSlider(value=1, description='x', max=2, min=-1), Output()), _dom_classes=('widget-int…"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"0: Other/Unrelated news, 1: Merger,\n",
"2: Topics related to deals, investments and mergers\n",
"___________________________________________________________________________________________________________\n",
"\n",
"\n",
"News article no. 9686.0:\n",
"\n",
"HEADLINE:\n",
"9686 China Guangfa Bank fined $109 million for rule violation - regulator\n",
"Name: Title, dtype: object\n",
"\n",
"TEXT:\n",
"9686 December 8, 2017 / 10:02 AM / in 15 minutes China Guangfa Bank fined $109 million for rule violation - regulator Reuters Staff 1 Min Read BEIJING, Dec 8 (Reuters) - Chinas banking regulator has fined China Guangfa Bank Co 722 million yuan ($109.12 million) for providing illegal guarantees for defaulted corporate bonds, it said on Friday. The high-yielding bonds were issued by southern Chinese phone maker Cosun Group and sold through an Alibaba Group Holding Ltd-backed online wealth management platform. ($1 = 6.6165 Chinese yuan renminbi) (Reporting By Shu Zhang and Beijing moniterding desk)\n",
"Name: Text, dtype: object\n"
]
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "9cb3a6c65de145f9bdff2ec4b8170610",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"interactive(children=(IntSlider(value=1, description='x', max=2, min=-1), Output()), _dom_classes=('widget-int…"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"0: Other/Unrelated news, 1: Merger,\n",
"2: Topics related to deals, investments and mergers\n",
"___________________________________________________________________________________________________________\n",
"\n",
"\n",
"News article no. 7898.0:\n",
"\n",
"HEADLINE:\n",
"7898 UPDATE 1-U.S. FDA panel backs approval of Novo Nordisk diabetes drug\n",
"Name: Title, dtype: object\n",
"\n",
"TEXT:\n",
"7898 41 PM / Updated 8 minutes ago UPDATE 1-U.S. FDA panel backs approval of Novo Nordisk diabetes drug Reuters Staff (Adds analysts comment, background) By Toni Clarke WASHINGTON, Oct 18 (Reuters) - Novo Nordisk A/Ss new diabetes drug semaglutide is effective, reasonably safe and should be approved, an advisory panel to the U.S. Food and Drug Administration concluded on Wednesday. The panel voted 16-0 with one abstention in favor of the drug being approved. It would compete with others in a class known as glucagon-like peptide-1 (GLP-1) analogs, which imitate an intestinal hormone that stimulates the production of insulin. The FDA typically follows the recommendations of its advisors. Novo Nordisk is hoping that semaglutide, a once-weekly injection, will take market share from Eli Lilly & Cos once-weekly Trulicity, which in turn has been taking share from Novo Nordisks once-daily Victoza. Novo Nordisk is also developing an oral form of semaglutide. We believe semaglutide will be a formidable competitor for Lillys Trulicity, Alex Arfaei, an analyst at BMO Capital Markets, said in a research note. Analysts on average expect annual semaglutide sales to reach $3.17 billion by 2023, with sales of Trulicity, which was approved in the United States in late-2014, rising to $3.71 in 2023, according to Thomson Reuters data. Panelists discussed data showing that semaglutide was associated with an initial worsening of diabetic retinopathy, a condition caused by damage to blood vessels in the retina due to high blood sugar levels. The damage can cause progressive deterioration in vision, potentially leading to blindness. But they found that the benefit of reducing blood sugar overall offset this risk, which the company argues is transient. Analysts expect the drugs label to carry a standard warning, similar to insulins, regarding diabetic retinopathy. The FDA is scheduled to make its decision on semaglutide by Dec. 5th. (Reporting by Toni Clarke; Editing by Sandra Maler) 0 : 0\n",
"Name: Text, dtype: object\n"
]
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "5b0b75ccbcf94b9cb8a805f1d07bf1d8",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"interactive(children=(IntSlider(value=0, description='x', max=2, min=-1), Output()), _dom_classes=('widget-int…"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"0: Other/Unrelated news, 1: Merger,\n",
"2: Topics related to deals, investments and mergers\n",
"___________________________________________________________________________________________________________\n",
"\n",
"\n",
"News article no. 5535.0:\n",
"\n",
"HEADLINE:\n",
"5535 Gold steady as cloudy U.S. rate hike outlook drags on dollar\n",
"Name: Title, dtype: object\n",
"\n",
"TEXT:\n",
"5535 July 18, 2017 / 1:00 AM / 3 hours ago Gold at 2-week high as cloudy U.S. rate hike outlook drags on dollar 3 Min Read An ounce of gold coin is pictured next to a 250g and a 500g ingots at Jolliet numismatic shop in Geneva November 19, 2014. Denis Balibouse BENGALURU (Reuters) - Gold prices rose to a two-week high on Tuesday as the dollar dipped to multi-month lows amid fading prospects of further rate hikes by the U.S. Federal Reserve this year and doubts whether President Donald Trump would be able to push through healthcare reforms. The U.S. dollar sank to a 10-month low against a basket of major currencies on Tuesday, hobbled by uncertainty over the pace of the Fed's policy tightening and setbacks to the passage of a U.S. healthcare bill. \"With the street repricing its U.S. interest rate outlook following soft data and a dovish Yellen, and with President Donald Trump's reflationary reforms seemingly lost in the legislative Bermuda Triangle of Congress, a weaker U.S. dollar should continue to support gold,\" said Jeffrey Halley, a senior market analyst at OANDA. Republicans in the U.S. Congress were in chaos over healthcare legislation after a second attempt to pass a bill in the Senate collapsed late on Monday, with President Donald Trump calling for an outright repeal of Obamacare and others seeking a change in direction toward bipartisanship. Spot gold was up 0.3 percent to $1,237.66 per ounce at 0631 GMT, after touching $1,238.76, the highest since July 3, earlier in the session. U.S. gold futures for August delivery rose 0.3 percent to $1,236.80 per ounce. \"At this moment, gold is likely to be in the trading range of $1,200-1,250,\" said Mark To, head of research at Hong Kong's Wing Fung Financial Group. Prices of the metal are unlikely to significantly break above these levels since there are no other major drivers, including geopolitical factors, for gold as of now, he added. Spot gold faces a resistance at $1,239 per ounce, and may temporarily hover below this level or retrace towards a support at $1,226, according to Reuters technical analyst Wang Tao. Meanwhile, SPDR Gold Trust, the world's largest gold-backed exchange-traded fund, said its holdings fell 0.21 percent to 827.07 tonnes on Monday from 828.84 tonnes on Friday. In other precious metals, silver rose 0.4 percent to $16.14 per ounce, after earlier touching its highest in just over two weeks at $16.23. Platinum rose 0.3 percent to $923.80 per ounce. It had touched an over one-month high of $934.40 in the previous session. Palladium was mostly unchanged at $864.98 per ounce. Reporting by Arpan Varghese in Bengaluru; Editing by Sunil Nair 0 : 0 \n",
"Name: Text, dtype: object\n"
]
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "f59b298ae7c144519c3c1d2cb86b0ae0",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"interactive(children=(IntSlider(value=0, description='x', max=2, min=-1), Output()), _dom_classes=('widget-int…"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"0: Other/Unrelated news, 1: Merger,\n",
"2: Topics related to deals, investments and mergers\n",
"___________________________________________________________________________________________________________\n",
"\n",
"\n",
"News article no. 966.0:\n",
"\n",
"HEADLINE:\n",
"966 UPDATE 1-Brazil's Vale produced record 349 mln tonnes of iron ore in 2016\n",
"Name: Title, dtype: object\n",
"\n",
"TEXT:\n",
"966 Company News - 31am EST UPDATE 1-Brazil's Vale produced record 349 mln tonnes of iron ore in 2016 (Adds production detail) BRASILIA Feb 16 Brazilian miner Vale SA said on Thursday it produced a record 349 million tonnes of iron ore in 2016, above its own guidance, helped by strong performance at mines in northern Brazil and the successful start of its new S11D mine. The world's largest producer of iron ore had forecast that output would be at the lower end of a range of 340-350 million tonnes. Vale said it produced 92.4 million tonnes in the fourth quarter, up 4.5 percent on the same period in 2015. Full-year production rose 1 percent on the previous year. The company said it had continued to halt or reduce higher cost tonnes from its mines in the southeastern state of Minas Gerais, offsetting them with cheaper production from northern Brazil where its costs are lower and quality higher. The S11D mine is Vale's largest ever iron ore project and is located in the Amazon, neighboring the company's other mines in the northern Brazilian state of Para. Guidance for 2017 remained at 360-380 million tonnes, Vale said, adding that by the end of 2018 it expected to reach an annual production rate of 400 million tonnes. Vale reported nickel production of 311,000 tonnes in 2016, 7 percent higher than in 2015 and a company record, after stronger performance at plants in Canada and New Caledonia. (Reporting by Stephen Eisenhammer; editing by Jason Neely and Jane Merriman) Next In Company News\n",
"Name: Text, dtype: object\n"
]
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "08e8a86e31974829a2daf76a5ff458eb",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"interactive(children=(IntSlider(value=2, description='x', max=2, min=-1), Output()), _dom_classes=('widget-int…"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"0: Other/Unrelated news, 1: Merger,\n",
"2: Topics related to deals, investments and mergers\n",
"___________________________________________________________________________________________________________\n",
"\n",
"\n"
]
}
],
"source": [
"for index in label_next:\n",
" show_next(index)"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Number of manual labels in round no. 16:\n",
"0:8, 1:1, 2:1\n",
"Number of articles to be corrected in this round: 4\n"
]
}
],
"source": [
"print('Number of manual labels in round no. {}:'.format(m))\n",
"print('0:{}, 1:{}, 2:{}'.format(len(df.loc[(df['Label'] == 0) & (df['Round'] == m)]), len(df.loc[(df['Label'] == 1) & (df['Round'] == m)]), len(df.loc[(df['Label'] == 2) & (df['Round'] == m)])))\n",
"\n",
"print('Number of articles to be corrected in this round: {}'.format(len(df.loc[(df['Label'] != -1) & (df['Estimated'] != -1) & (df['Round'] == m) & (df['Label'] != df['Estimated'])])))"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [],
"source": [
"# save intermediate status\n",
"df.to_csv('../data/interactive_labeling_round_{}_temp.csv'.format(m),\n",
" sep='|',\n",
" mode='w',\n",
" encoding='utf-8',\n",
" quoting=csv.QUOTE_NONNUMERIC,\n",
" quotechar='\\'')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Resubstitution error: Multinomial Naive Bayes ##"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"X_train_test = df.loc[df['Label'] != -1, 'Title'] + ' ' + df.loc[df['Label'] != -1, 'Text']\n",
"y_train_test = df.loc[df['Label'] != -1, 'Label']"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# discard old indices\n",
"y_train_test = y_train_test.reset_index(drop=True)\n",
"X_train_test = X_train_test.reset_index(drop=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# use my own BagOfWords python implementation\n",
"stemming = True\n",
"rel_freq = True\n",
"extracted_words = BagOfWords.extract_all_words(X_train_test)\n",
"vocab = BagOfWords.make_vocab(extracted_words)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# fit the training data and return the matrix\n",
"training_data = BagOfWords.make_matrix(extracted_words, vocab, rel_freq, stemming)\n",
"testing_data = training_data"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Naive Bayes\n",
"classifier = MultinomialNB(alpha=1.0e-10, fit_prior=False, class_prior=None)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# optional, nur bei resubstitutionsfehler\n",
"\n",
"n = 0\n",
"for i in range(len(y_train_test)):\n",
" if y_train_test[i] != predictions[i]:\n",
" n += 1\n",
" print('error no.{}'.format(n))\n",
" print('prediction at index {} is: {}, but actual is: {}'.format(i, predictions[i], y_train_test[i]))\n",
" print(X_train_test[i])\n",
" print(y_train_test[i])\n",
" print()\n",
"if n==0:\n",
" print('no resubstitution error :-)')\n",
"else:\n",
" print('number of wrong estimated articles: {}'.format(n))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print('End of this round (no. {}):'.format(m))\n",
"print('Number of manually labeled articles: {}'.format(len(df.loc[df['Label'] != -1])))\n",
"print('Number of manually unlabeled articles: {}'.format(len(df.loc[df['Label'] == -1])))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# save this round to csv\n",
"df.to_csv('../data/interactive_labeling_round_{}_neu.csv'.format(m),\n",
" sep='|',\n",
" mode='w',\n",
" encoding='utf-8',\n",
" quoting=csv.QUOTE_NONNUMERIC,\n",
" quotechar='\\'')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"NOW PLEASE CONTINUE WITH PART II.\n",
"REPEAT UNTIL ALL SAMPLES ARE LABELED."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.1"
}
},
"nbformat": 4,
"nbformat_minor": 2
}