thesis-anne/src/working notebooks/2019-02-19-al-neueRunden0-9...

1657 lines
110 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Jupyter Notebook for Interactive Labeling\n",
"______\n",
"\n",
"This Jupyter Notebook combines a manual and automated labeling technique.\n",
"It includes a basic implementation of Multinomial Bayes Classifier.\n",
"By calculating estimated class probabilities, we decide whether a news article has to be labeled manually or can be labeled automatically.\n",
"For multiclass labeling, 3 classes are used.\n",
" \n",
"Please note: User instructions are written in upper-case.\n",
"__________\n",
"Version: 2019-02-28, Anne Lorenz"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import csv\n",
"import operator\n",
"import pickle\n",
"import random\n",
"\n",
"from ipywidgets import interact, interactive, fixed, interact_manual\n",
"import ipywidgets as widgets\n",
"from IPython.core.interactiveshell import InteractiveShell\n",
"from IPython.display import display\n",
"import matplotlib\n",
"import matplotlib.pyplot as plt\n",
"import numpy as np\n",
"import pandas as pd\n",
"from sklearn.feature_extraction.text import CountVectorizer\n",
"from sklearn.feature_selection import SelectPercentile\n",
"from sklearn.metrics import recall_score, precision_score, f1_score, make_scorer\n",
"from sklearn.model_selection import GridSearchCV\n",
"from sklearn.model_selection import StratifiedKFold\n",
"from sklearn.pipeline import Pipeline\n",
"from sklearn.naive_bayes import MultinomialNB\n",
"from sklearn.semi_supervised import label_propagation\n",
"\n",
"from BagOfWords import BagOfWords\n",
"from MNBInteractive import MNBInteractive\n",
"from MultinomialNaiveBayes import MultinomialNaiveBayes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Part I: Data preparation\n",
"\n",
"First, we import our data set of 10 000 business news articles from a csv file.\n",
"It contains 833/834 articles of each month of the year 2017.\n",
"For detailed information regarding the data set, please read the full documentation."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"# initialize random => reproducible sequence\n",
"random.seed(5)\n",
"\n",
"filepath = '../data/cleaned_data_set_without_header.csv'\n",
"\n",
"# set up wider display area\n",
"pd.set_option('display.max_colwidth', -1)\n",
"\n",
"# show full text for print statement\n",
"InteractiveShell.ast_node_interactivity = \"all\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We load the previously created dictionary of all article indices (keys) with a list of mentioned organizations (values).\n",
"In the following, we limit the number of occurences of a certain company name in all labeled articles to 3 to avoid imbalance."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"PLEASE INSERT M MANUALLY IF PROCESS HAS BEEN INTERRUPTED BEFORE."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"m=16"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"This round number: 16\n",
"Number of manually labeled articles: 1132\n",
"Number of manually unlabeled articles: 8868\n"
]
}
],
"source": [
"# read current data set from csv\n",
"df = pd.read_csv('../data/interactive_labeling_round_{}_temp.csv'.format(m),\n",
" sep='|',\n",
" usecols=range(1,13), # drop first column 'unnamed'\n",
" encoding='utf-8',\n",
" quoting=csv.QUOTE_NONNUMERIC,\n",
" quotechar='\\'')\n",
"\n",
"# find current iteration/round number\n",
"m = int(df['Round'].max())\n",
"print('This round number: {}'.format(m))\n",
"print('Number of manually labeled articles: {}'.format(len(df.loc[df['Label'] != -1])))\n",
"print('Number of manually unlabeled articles: {}'.format(len(df.loc[df['Label'] == -1])))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"START:\n",
"\n",
"Building the training data set using stratified sampling:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"m = 0"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# iteration number\n",
"m"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"#set_0 = df.loc[(df['Round'] < m) & (df['Label'] == 0)]\n",
"#set_1 = df.loc[(df['Round'] < m) & (df['Label'] == 1)]\n",
"#set_2 = df.loc[(df['Round'] < m) & (df['Label'] == 2)]"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"84"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"text/plain": [
"3"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"text/plain": [
"13"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"text/plain": [
"3"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"strat_len = min(len(set_0), len(set_1), len(set_2))\n",
"len(set_0)\n",
"len(set_1)\n",
"len(set_2)\n",
"strat_len"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Uuid</th>\n",
" <th>Title</th>\n",
" <th>Text</th>\n",
" <th>Site</th>\n",
" <th>SiteSection</th>\n",
" <th>Url</th>\n",
" <th>Timestamp</th>\n",
" <th>Index</th>\n",
" <th>Round</th>\n",
" <th>Label</th>\n",
" <th>Probability</th>\n",
" <th>Estimated</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>9819</th>\n",
" <td>05581b3fad7dde79fb4a44ab11400222423903a7</td>\n",
" <td>CANADA STOCKS-TSX posts record high as energy, marijuana company shares climb</td>\n",
" <td>December 27, 2017 / 9:52 PM / in 8 minutes CANADA STOCKS-TSX posts record high as energy, marijuana company shares climb Reuters Staff (Adds details throughout and updates prices to close) * TSX ends up 37.86 points, or 0.23 percent, at 16,203.13 * Index posts record closing high * Energy climbs 1.6 percent * Canopy Growth Co surges 20.1 percent TORONTO, Dec 27 (Reuters) - Canadas main stock index rose on Wednesday to a record high as a recent rally in commodity prices boosted the energy and materials sectors, while healthcare gained more than 6 percent as shares of marijuana companies jumped. * The Toronto Stock Exchanges S&amp;P/TSX composite index ended up 37.86 points, or 0.23 percent, at 16,203.13, a record closing high. * Energy shares climbed 1.6 percent, with Suncor Energy Inc up 2.5 percent at C$45.82. * The price of U.S. crude oil settled 0.6 percent lower at $59.64 a barrel. But it had touched a 2-1/2-year high in the previous session when the TSX was closed for the Boxing Day holiday. * The materials group, which includes precious and base metals miners and fertilizer companies, added 0.9 percent. * Teck Resources Ltd, which exports steelmaking coal and mines metals, including copper, gained 3.1 percent to C$33.33. * Copper prices advanced 1.3 percent to $7,219 a tonne. * Five of the TSXs 10 main groups ended higher. * Shares of marijuana companies rose after Canadian regulators rejected Aurora Cannabis Incs request to shorten the minimum deposit period to 35 days from 105 days for the hostile takeover of CanniMed Therapeutics Inc. * Aurora Cannabis gained 11.1 percent and CanniMed Therapeutics rose nearly 4 percent, while Canopy Growth Co was the largest percentage gainer on the TSX. It surged 20.1 percent to C$27.77. * The largest decliner on the index was Centerra Gold , which plunged 10.3 percent to C$6.50 after the company said mill processing operations at the Mount Milligan Mine in British Columbia have been temporarily suspended due to lack of sufficient water resources. * The heavyweight financials group fell 0.3 percent and technology shares declined 0.6 percent. * Advancing issues outnumbered declining ones on the TSX by 143 to 96, for a 1.49-to-1 ratio on the upside. * The index was posting nine new 52-week highs and one new low. (Reporting by Fergal Smith; Editing by Bill Trott and Meredith Mazzilli)</td>\n",
" <td>reuters.com</td>\n",
" <td>http://feeds.reuters.com/reuters/companyNews</td>\n",
" <td>https://www.reuters.com/article/canada-stocks/canada-stocks-tsx-posts-record-high-as-energy-marijuana-company-shares-climb-idUSL1N1OR169</td>\n",
" <td>2017-12-27T23:48:00.000+02:00</td>\n",
" <td>9819.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>NaN</td>\n",
" <td>-1.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>850</th>\n",
" <td>d6970501f4180455abb50f120d4f0c1da7818c2a</td>\n",
" <td>UPDATE 1-Russia's Detsky Mir prices IPO at bottom of range-sources</td>\n",
" <td>(Adds details about demand, background)MOSCOW Feb 8 Russia's largest children's goods retailer Detsky Mir has priced its initial public offering at 85 roubles ($1.43) per share, at the bottom of the 85-87 rouble range, two sources familiar with the deal said on Wednesday.The company saw bids for more than 1.5 times the number of shares on offer, drawing strong demand from foreign investors, said another source, who is close to the placement.More than 30 percent of demand came from U.S. investors, around 35 percent from Europe, less than 10 percent from Russia and more than 25 percent from the Middle East and Asia, he said.The source added there were \"hedge funds, long-only investors including sovereign wealth funds\" among the buyers and that nobody would take a dominant position.The IPO is a test of how quickly investor appetite for Russian assets is recovering, after a three-year period when the economy was buffeted by a slump in oil prices, economic slowdown, and Western sanctions imposed over the conflict in Ukraine.The transaction comprised shares sold by the Sistema conglomerate and the Russia-China Investment Fund.Detsky Mir declined to comment. ($1 = 59.4112 roubles) (Reporting by Maria Kiselyova and Olga Sichkar; Editing by Christian Lowe)</td>\n",
" <td>reuters.com</td>\n",
" <td>http://in.reuters.com/finance/deals</td>\n",
" <td>http://in.reuters.com/article/russia-detsky-mir-ipo-price-idINL5N1FT0DF</td>\n",
" <td>2017-02-08T02:45:00.000+02:00</td>\n",
" <td>850.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>NaN</td>\n",
" <td>-1.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3242</th>\n",
" <td>41dfed3520e939bff77261f50947154ace5b0a77</td>\n",
" <td>Takata rescue talks extended, even as bankruptcy risk looms</td>\n",
" <td>By Taiga Uranaka and Maki Shiraki - TOKYO, April 14 TOKYO, April 14 Potential rescuers of Japan's Takata Corp have extended talks, already in their 14th month, for a deal to take over the air bag maker at the heart of the auto industry's biggest safety recall, people briefed on the process said.Car-parts maker Key Safety Systems Inc (KSS) and Bain Capital LLC are the preferred bidder for Takata, whose faulty air bags have been blamed for at least 16 deaths worldwide.Discussions that include the steering committee tapped by the air bag maker to oversee the search for a financial sponsor, automaker clients, suitors and bankers are now likely to run on until at least end-May, three people told Reuters.The parties have already moved beyond an informal, self-imposed end-March deadline to thrash out a deal.Recent talks, described by two participants as chaotic, have focused on issues such as an indemnity agreement to cover reimbursement costs for air bag recalls, estimated to be as high as $10 billion.KSS, a U.S.-based maker of air bags, seatbelts and steering wheels, and Bain, a U.S. private equity fund, are still conducting due diligence, one of those close to the matter said.Another said KSS - which was bought last year by China's Ningbo Joyson Electronic Corp - and Bain plan to offer around 200 billion yen ($1.8 billion) for Takata.A spokesman for Takata and the steering committee declined to comment. A spokeswoman for KSS also declined to comment.Automakers including Honda Motor Co, which have been footing the bill for recalls dating back to 2008, want Takata restructured through a transparent court-ordered process such as bankruptcy, which would wipe out the firm's shareholder value, four automaker sources have told Reuters.\"There's no other option,\" said one automaker executive. \"A privately arranged restructuring would require them to repay billions. They can't afford that.\"But Takata, the world's second-biggest air bag maker, is holding out for a private restructuring that would preserve some of the founding Takada familys 60 percent stake.BATTERED REPUTATIONThe clock is ticking for Takata, whose stock has cratered 90 percent since the recall crisis began escalating in early 2014.U.S. federal Judge George Steeh in February cited the potential for Takata to collapse if it couldnt find a buyer.Takata pleaded guilty in Steehs District Court to a felony charge as part of a $1 billion settlement with automakers and victims of its inflators, which can explode with excessive force, blasting shrapnel into passenger areas.The company, which began as a textiles firm and became an early maker of seatbelts, is also trying to settle legal liabilities in the United States, where it faces a class-action lawsuit, and other countries where its air bag inflators have exploded.Takata has denied speculation it would have to seek some form of bankruptcy protection from creditors in the United States or Japan.The company has not been allowed to simply disappear as the auto industry needs it to keep producing the millions of inflators needed to replace recalled air bags - though some automakers have switched to rival suppliers.Also, the government in Tokyo is keen to preserve a major Japanese maker of air bags in a global industry dominated by just three companies.($1 = 108.8300 yen) (Additional reporting by, Taro Fuse, Naomi Tajitsu and Junko Fujita; Editing by William Mallard and Ian Geoghegan)</td>\n",
" <td>reuters.com</td>\n",
" <td>http://in.reuters.com/finance/deals</td>\n",
" <td>http://in.reuters.com/article/takata-restructuring-idINL8N1HA098</td>\n",
" <td>2017-04-14T13:56:00.000+03:00</td>\n",
" <td>3242.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>NaN</td>\n",
" <td>-1.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Uuid \\\n",
"9819 05581b3fad7dde79fb4a44ab11400222423903a7 \n",
"850 d6970501f4180455abb50f120d4f0c1da7818c2a \n",
"3242 41dfed3520e939bff77261f50947154ace5b0a77 \n",
"\n",
" Title \\\n",
"9819 CANADA STOCKS-TSX posts record high as energy, marijuana company shares climb \n",
"850 UPDATE 1-Russia's Detsky Mir prices IPO at bottom of range-sources \n",
"3242 Takata rescue talks extended, even as bankruptcy risk looms \n",
"\n",
" Text \\\n",
"9819 December 27, 2017 / 9:52 PM / in 8 minutes CANADA STOCKS-TSX posts record high as energy, marijuana company shares climb Reuters Staff (Adds details throughout and updates prices to close) * TSX ends up 37.86 points, or 0.23 percent, at 16,203.13 * Index posts record closing high * Energy climbs 1.6 percent * Canopy Growth Co surges 20.1 percent TORONTO, Dec 27 (Reuters) - Canadas main stock index rose on Wednesday to a record high as a recent rally in commodity prices boosted the energy and materials sectors, while healthcare gained more than 6 percent as shares of marijuana companies jumped. * The Toronto Stock Exchanges S&P/TSX composite index ended up 37.86 points, or 0.23 percent, at 16,203.13, a record closing high. * Energy shares climbed 1.6 percent, with Suncor Energy Inc up 2.5 percent at C$45.82. * The price of U.S. crude oil settled 0.6 percent lower at $59.64 a barrel. But it had touched a 2-1/2-year high in the previous session when the TSX was closed for the Boxing Day holiday. * The materials group, which includes precious and base metals miners and fertilizer companies, added 0.9 percent. * Teck Resources Ltd, which exports steelmaking coal and mines metals, including copper, gained 3.1 percent to C$33.33. * Copper prices advanced 1.3 percent to $7,219 a tonne. * Five of the TSXs 10 main groups ended higher. * Shares of marijuana companies rose after Canadian regulators rejected Aurora Cannabis Incs request to shorten the minimum deposit period to 35 days from 105 days for the hostile takeover of CanniMed Therapeutics Inc. * Aurora Cannabis gained 11.1 percent and CanniMed Therapeutics rose nearly 4 percent, while Canopy Growth Co was the largest percentage gainer on the TSX. It surged 20.1 percent to C$27.77. * The largest decliner on the index was Centerra Gold , which plunged 10.3 percent to C$6.50 after the company said mill processing operations at the Mount Milligan Mine in British Columbia have been temporarily suspended due to lack of sufficient water resources. * The heavyweight financials group fell 0.3 percent and technology shares declined 0.6 percent. * Advancing issues outnumbered declining ones on the TSX by 143 to 96, for a 1.49-to-1 ratio on the upside. * The index was posting nine new 52-week highs and one new low. (Reporting by Fergal Smith; Editing by Bill Trott and Meredith Mazzilli) \n",
"850 (Adds details about demand, background)MOSCOW Feb 8 Russia's largest children's goods retailer Detsky Mir has priced its initial public offering at 85 roubles ($1.43) per share, at the bottom of the 85-87 rouble range, two sources familiar with the deal said on Wednesday.The company saw bids for more than 1.5 times the number of shares on offer, drawing strong demand from foreign investors, said another source, who is close to the placement.More than 30 percent of demand came from U.S. investors, around 35 percent from Europe, less than 10 percent from Russia and more than 25 percent from the Middle East and Asia, he said.The source added there were \"hedge funds, long-only investors including sovereign wealth funds\" among the buyers and that nobody would take a dominant position.The IPO is a test of how quickly investor appetite for Russian assets is recovering, after a three-year period when the economy was buffeted by a slump in oil prices, economic slowdown, and Western sanctions imposed over the conflict in Ukraine.The transaction comprised shares sold by the Sistema conglomerate and the Russia-China Investment Fund.Detsky Mir declined to comment. ($1 = 59.4112 roubles) (Reporting by Maria Kiselyova and Olga Sichkar; Editing by Christian Lowe) \n",
"3242 By Taiga Uranaka and Maki Shiraki - TOKYO, April 14 TOKYO, April 14 Potential rescuers of Japan's Takata Corp have extended talks, already in their 14th month, for a deal to take over the air bag maker at the heart of the auto industry's biggest safety recall, people briefed on the process said.Car-parts maker Key Safety Systems Inc (KSS) and Bain Capital LLC are the preferred bidder for Takata, whose faulty air bags have been blamed for at least 16 deaths worldwide.Discussions that include the steering committee tapped by the air bag maker to oversee the search for a financial sponsor, automaker clients, suitors and bankers are now likely to run on until at least end-May, three people told Reuters.The parties have already moved beyond an informal, self-imposed end-March deadline to thrash out a deal.Recent talks, described by two participants as chaotic, have focused on issues such as an indemnity agreement to cover reimbursement costs for air bag recalls, estimated to be as high as $10 billion.KSS, a U.S.-based maker of air bags, seatbelts and steering wheels, and Bain, a U.S. private equity fund, are still conducting due diligence, one of those close to the matter said.Another said KSS - which was bought last year by China's Ningbo Joyson Electronic Corp - and Bain plan to offer around 200 billion yen ($1.8 billion) for Takata.A spokesman for Takata and the steering committee declined to comment. A spokeswoman for KSS also declined to comment.Automakers including Honda Motor Co, which have been footing the bill for recalls dating back to 2008, want Takata restructured through a transparent court-ordered process such as bankruptcy, which would wipe out the firm's shareholder value, four automaker sources have told Reuters.\"There's no other option,\" said one automaker executive. \"A privately arranged restructuring would require them to repay billions. They can't afford that.\"But Takata, the world's second-biggest air bag maker, is holding out for a private restructuring that would preserve some of the founding Takada familys 60 percent stake.BATTERED REPUTATIONThe clock is ticking for Takata, whose stock has cratered 90 percent since the recall crisis began escalating in early 2014.U.S. federal Judge George Steeh in February cited the potential for Takata to collapse if it couldnt find a buyer.Takata pleaded guilty in Steehs District Court to a felony charge as part of a $1 billion settlement with automakers and victims of its inflators, which can explode with excessive force, blasting shrapnel into passenger areas.The company, which began as a textiles firm and became an early maker of seatbelts, is also trying to settle legal liabilities in the United States, where it faces a class-action lawsuit, and other countries where its air bag inflators have exploded.Takata has denied speculation it would have to seek some form of bankruptcy protection from creditors in the United States or Japan.The company has not been allowed to simply disappear as the auto industry needs it to keep producing the millions of inflators needed to replace recalled air bags - though some automakers have switched to rival suppliers.Also, the government in Tokyo is keen to preserve a major Japanese maker of air bags in a global industry dominated by just three companies.($1 = 108.8300 yen) (Additional reporting by, Taro Fuse, Naomi Tajitsu and Junko Fujita; Editing by William Mallard and Ian Geoghegan) \n",
"\n",
" Site SiteSection \\\n",
"9819 reuters.com http://feeds.reuters.com/reuters/companyNews \n",
"850 reuters.com http://in.reuters.com/finance/deals \n",
"3242 reuters.com http://in.reuters.com/finance/deals \n",
"\n",
" Url \\\n",
"9819 https://www.reuters.com/article/canada-stocks/canada-stocks-tsx-posts-record-high-as-energy-marijuana-company-shares-climb-idUSL1N1OR169 \n",
"850 http://in.reuters.com/article/russia-detsky-mir-ipo-price-idINL5N1FT0DF \n",
"3242 http://in.reuters.com/article/takata-restructuring-idINL8N1HA098 \n",
"\n",
" Timestamp Index Round Label Probability \\\n",
"9819 2017-12-27T23:48:00.000+02:00 9819.0 0.0 0.0 NaN \n",
"850 2017-02-08T02:45:00.000+02:00 850.0 0.0 0.0 NaN \n",
"3242 2017-04-14T13:56:00.000+03:00 3242.0 0.0 0.0 NaN \n",
"\n",
" Estimated \n",
"9819 -1.0 \n",
"850 -1.0 \n",
"3242 -1.0 "
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Uuid</th>\n",
" <th>Title</th>\n",
" <th>Text</th>\n",
" <th>Site</th>\n",
" <th>SiteSection</th>\n",
" <th>Url</th>\n",
" <th>Timestamp</th>\n",
" <th>Index</th>\n",
" <th>Round</th>\n",
" <th>Label</th>\n",
" <th>Probability</th>\n",
" <th>Estimated</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>2730</th>\n",
" <td>9f6bd526a9c9c9305b18b4d121d723ac878012ef</td>\n",
" <td>LVMH's Arnault to take full control of Christian Dior</td>\n",
" <td>By Dominique Vidalon and Gilles Guillaume - PARIS PARIS French billionaire Bernard Arnault will combine the Christian Dior fashion brand with his LVMH luxury goods empire as part of a 12 billion euro ($13 billion) move to simplify his business interests - a restructuring long demanded by other investors.Under a series of complex transactions, LVMH ( LVMH.PA ), the world's largest luxury group, will buy the Christian Dior Couture brand from the Christian Dior ( DIOR.PA ) holding company for 6.5 billion euros, including debt.The deal will unite the 70 year old fashion label worn by film stars from Grace Kelly and Elizabeth Taylor to Jennifer Lawrence and Natalie Portman with the Christian Dior perfume and beauty business already owned by LVMH.The Arnault family, which holds a 47 percent stake in LVMH, will also offer to buy the 25.9 percent of the Christian Dior holding company it does not already own for about 260 euros per share, a premium of 15 percent over Monday's closing price.The transactions \"will allow the simplification of the structures, long requested by the market, and the strengthening of LVMH's Fashion and Leather Goods division,\" the 68-year-old Arnault said in a statement.LVMH shares rose almost 5 percent to a record high of 225 euros as investors welcomed the deals, which they expect to boost LVMH earnings. Dior shares also jumped 13 percent to a new high of 256 euros.\"This is a good acquisition for LVMH in our view given the strong brand of Christian Dior, good use of its balance sheet and it reunites the Christian Dior brand with the very profitable perfume operation that LVMH operates,\" Barclays analysts wrote in a research note.LAST BIG DEAL?LVMH said it would use a loan to pay for Christian Dior Couture, which has 198 stores in over 60 countries, and whose sales have doubled over the past five years.Exane BNP Paribas analyst Luca Solca welcomed \"the long awaited LVMH and Dior merger\", which he said was made at a reasonable valuation. Including debt, LVMH is paying 15.6 times Dior's 2017 earnings before interest, taxes, depreciation and amortization (EBITDA).Solca added the deal also reduced the risk of LVMH, whose brands include Louis Vuitton and Hennessy cognac, buying pricey, \"trophy assets\".Finance chief Jean-Jacques Guiony declined to comment on LVMH's future mergers and acquisitions (M&amp;A) policy. But Arnault told the Financial Times that LVMH was not hunting for acquisitions as \"fewer and fewer assets are looking attractive to us. And the best assets are not for sale.\"The Dior holding company owns 41 percent of the LVMH group and 100 percent of Christian Dior Couture, the home of the Lady Dior handbag.Arnault's family company will offer 172 euros per share and 0.192 Hermes ( HRMS.PA ) shares for each Dior holding company share. There are potential all-cash and all-share alternatives.Arnault has a stake of about 8 percent in luxury group Hermes ( HRMS.PA ), and Hermes' shares fell from earlier record highs on the prospect of more of the stock coming to the market.LVMH said the overall deal would boost earnings per share by some 3 percent within the first year of its completion, with the transactions expected to close during the second half of 2017.(Additional reporting by Blandine Henault; Editing by Andrew Callus and Mark Potter)</td>\n",
" <td>reuters.com</td>\n",
" <td>http://feeds.reuters.com/reuters/INbusinessNews</td>\n",
" <td>http://in.reuters.com/article/us-lvmh-dior-idINKBN17R0I1</td>\n",
" <td>2017-04-25T17:50:00.000+03:00</td>\n",
" <td>2730.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>NaN</td>\n",
" <td>-1.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4043</th>\n",
" <td>04e51a867745700ec080b4e17618d8c37bfe4122</td>\n",
" <td>EU mergers and takeovers (May 19)</td>\n",
" <td>BRUSSELS May 19 The following are mergers under review by the European Commission and a brief guide to the EU merger process:APPROVALS AND WITHDRAWALS-- U.S. packaging company WestRock to acquire U.S. peer Multi Packaging Solutions (approved May 18)-- Italian cinema operator The Space Cinema, which is controlled by Vue International Holdco Ltd, and Italian peer UCI Italian S.p.A. which is part of Chinese conglomerate Dalian Wanda Group, to set up a joint venture (approved May 18)-- Investment companies TPG and Oaktree to take joint control over Britain's Iona Energy Co, which owns 75 percent of two undeveloped oil fields in the North Sea and that will be active in crude oil production and sale (approved May 18)-- French aircraft engine and aerospace equipment company Safran and China Eastern Airlines Co. Ltd. to form joint venture to provide aircraft maintenance in China (approved May 18)-- Energy company Electricite de France, French state-owned bank Caisse des depots et consignations and Japan's Mitsubishi Corporation to create a joint venture NGM to finance electric mobility projects mainly in France (approved May 18)NEW LISTINGS-- Chinese conglomerate HNA Holding Group Co to acquire Singapore-listed logistics company CWT (notified May 18/deadline June 27/simplified)-- Buyout firm Blackstone and Canada Pension Plan Investment Board (CPPIB) to acquire indirect joint control of U.S. educational content provider Ascend Learning (notified May 18/deadline June 27/simplified)EXTENSIONS AND OTHER CHANGESNoneFIRST-STAGE REVIEWS BY DEADLINEMAY 22-- Investment firms Cinven Capital Management and Canada Pension Plan Investment Board to acquire joint control of Travel Holdings Parent Corporation (notified April 10/deadline May 22)MAY 29-- French EDF to acquire equipment and fuel manufacturing company Areva (notified April 18/deadline May 29)MAY 30-- French media group Vivendi to acquire de facto sole control of Italy's Telecom Italia (notified March 31/deadline extended to May 30 from May 12 after Vivendi offered concessions)MAY 31-- Manufacturing and technology company General Electric's Oil &amp; Gas to acquire oilfield services company Baker Hughes (notified April 20/deadline May 31)JUNE 1-- Waste water company SGAB and Spanish infrastructure company Acciona to acquire 10 percent of Sociedad Concesionaria de la Zona Regable del Canal de Navarra (notified April 21/deadline June 1/simplified)JUNE 2-- Australian bank Macquarie and British pension fund Universities Superannuation Scheme to acquire Green Investment Bank (notified April 24/deadline June 2/simplified)JUNE 7-- German company CWS-Boco, which is part of German firm Haniel, to acquire some of British support services firm Rentokil's workwear and hygiene units (notified April 26/deadline June 7)JUNE 8-- German chemicals company Evonik Industries to acquire U.S. company J.M. Huber Corp's silica business (notified April 27/deadline June 8)JUNE 9-- Private equity firm Hellman &amp; Friedman to acquire Spanish logistics platform Allfunds Bank (notified April 28/deadline June 9/simplified)-- U.S. smartphone chipmaker Qualcomm to acquire Dutch companyr NXP Semiconductors NV (notified April 28/deadline June 9)-- Chinese textiles company Shanghai Shenda to acquire International Automotive Components Group's trim and acoustics unit business (notified April 24/deadline June 9/simplified)JUNE 12-- American healthcare company Johnson &amp; Johnson to acquire Swiss biotech company Actelion (notified April 12/deadline extended to June 12 from May 24 after the companies offered concessions)-- Norwegian debt collection agency Nordic Capital, which is majority owned by Nordic Capital Fund VIII and Swedish peer firm Intrum Justitia to merge (notified April 12/deadline extended to June 12 from May 24 after the companies offered concessions)JUNE 14-- Private equity firms BC Partners and Pollen Street Capital Ltd to jointly acquire UK bank Shawbrook Group plc (notified May 4/deadline June 14/simplified)JUNE 15-- U.S. private equity firm Leonard Green &amp; Partners and the Ontario Municipal Employees Retirement System Primary Pension Plan (OMERS) to acquire joint control of U.S. car repairs company OPE Caliber Holdings (notified May 5/deadline June 15/simplified)-- Austrian refractories materials maker RHI to acquire a controlling stake in Brazilian peer Magnesita Refratarios (notified May 5/deadline June 15)JUNE 21-- Investment bank Goldman Sachs and French investment company Eurazeo to jointly acquire Dominion Web Solutions (notified May 12/deadline June 21/simplified)-- French private equity company Ardian France and real estate agent Jones Lang LaSalle Inc to jointly acquire an office building in France (notified May 12/deadline June 21/simplified)-- French minerals company Imerys to acquire French calcium aluminate cements maker Kerneos (notified May 12/deadline June 21)JUNE 22-- German online fashion retailer Zalando and fashion company Bestseller United to set up a joint venture (notified May 15/deadline June 22/simplified)JUNE 26-- Private equity firms Advent International and Bain Capital Investors to jointly acquire payment services company RatePAY (notified May 17/deadline June 26/simplified)-- Private equity firm Oaktree to acquire German nursing care provider Vitanas P&amp;W (notified May 17/deadline June 26/simplified)GUIDE TO EU MERGER PROCESSDEADLINES:The European Commission has 25 working days after a deal is filed for a first-stage review. It may extend that by 10 working days to 35 working days, to consider either a company's proposed remedies or an EU member state's request to handle the case.Most mergers win approval but occasionally the Commission opens a detailed second-stage investigation for up to 90 additional working days, which it may extend to 105 working days.SIMPLIFIED:Under the simplified procedure, the Commission announces the clearance of uncontroversial first-stage mergers without giving any reason for its decision. Cases may be reclassified as non-simplified - that is, ordinary first-stage reviews - until they are approved. (Reporting by Foo Yun Chee)</td>\n",
" <td>reuters.com</td>\n",
" <td>http://in.reuters.com/finance/deals</td>\n",
" <td>http://in.reuters.com/article/eu-ma-idINL8N1IL4K0</td>\n",
" <td>2017-05-19T15:45:00.000+03:00</td>\n",
" <td>4043.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>NaN</td>\n",
" <td>-1.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5950</th>\n",
" <td>4dde03a74e2ac98954002bfc960a5f1d60182649</td>\n",
" <td>Canada's OneREIT to be taken private in a C$1.1 billion deal</td>\n",
" <td>August 4, 2017 / 12:57 PM / 3 hours ago Canada's OneREIT to be taken private in a C$1.1 billion deal 1 Min Read (Reuters) - Canada's OneREIT ( ONR_u.TO ) said on Friday it would go private after being bought by SmartREIT and Strathallen Acquisitions Inc in a C$1.1 billion deal, including debt. Under the terms of the deal, shareholders of OneREIT, which owns and operates shopping centers in Canada, will receive C$4.26 per share in cash and SmartREIT unit. The company said it was exploring strategic alternatives earlier this year. Reporting by Ahmed Farhatha in Bengaluru; Editing by Arun Koyyur 0 : 0</td>\n",
" <td>reuters.com</td>\n",
" <td>http://in.reuters.com/finance/deals</td>\n",
" <td>https://in.reuters.com/article/us-onereit-m-a-smartreit-idINKBN1AK1II</td>\n",
" <td>2017-08-04T10:57:00.000+03:00</td>\n",
" <td>5950.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>NaN</td>\n",
" <td>-1.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Uuid \\\n",
"2730 9f6bd526a9c9c9305b18b4d121d723ac878012ef \n",
"4043 04e51a867745700ec080b4e17618d8c37bfe4122 \n",
"5950 4dde03a74e2ac98954002bfc960a5f1d60182649 \n",
"\n",
" Title \\\n",
"2730 LVMH's Arnault to take full control of Christian Dior \n",
"4043 EU mergers and takeovers (May 19) \n",
"5950 Canada's OneREIT to be taken private in a C$1.1 billion deal \n",
"\n",
" Text \\\n",
"2730 By Dominique Vidalon and Gilles Guillaume - PARIS PARIS French billionaire Bernard Arnault will combine the Christian Dior fashion brand with his LVMH luxury goods empire as part of a 12 billion euro ($13 billion) move to simplify his business interests - a restructuring long demanded by other investors.Under a series of complex transactions, LVMH ( LVMH.PA ), the world's largest luxury group, will buy the Christian Dior Couture brand from the Christian Dior ( DIOR.PA ) holding company for 6.5 billion euros, including debt.The deal will unite the 70 year old fashion label worn by film stars from Grace Kelly and Elizabeth Taylor to Jennifer Lawrence and Natalie Portman with the Christian Dior perfume and beauty business already owned by LVMH.The Arnault family, which holds a 47 percent stake in LVMH, will also offer to buy the 25.9 percent of the Christian Dior holding company it does not already own for about 260 euros per share, a premium of 15 percent over Monday's closing price.The transactions \"will allow the simplification of the structures, long requested by the market, and the strengthening of LVMH's Fashion and Leather Goods division,\" the 68-year-old Arnault said in a statement.LVMH shares rose almost 5 percent to a record high of 225 euros as investors welcomed the deals, which they expect to boost LVMH earnings. Dior shares also jumped 13 percent to a new high of 256 euros.\"This is a good acquisition for LVMH in our view given the strong brand of Christian Dior, good use of its balance sheet and it reunites the Christian Dior brand with the very profitable perfume operation that LVMH operates,\" Barclays analysts wrote in a research note.LAST BIG DEAL?LVMH said it would use a loan to pay for Christian Dior Couture, which has 198 stores in over 60 countries, and whose sales have doubled over the past five years.Exane BNP Paribas analyst Luca Solca welcomed \"the long awaited LVMH and Dior merger\", which he said was made at a reasonable valuation. Including debt, LVMH is paying 15.6 times Dior's 2017 earnings before interest, taxes, depreciation and amortization (EBITDA).Solca added the deal also reduced the risk of LVMH, whose brands include Louis Vuitton and Hennessy cognac, buying pricey, \"trophy assets\".Finance chief Jean-Jacques Guiony declined to comment on LVMH's future mergers and acquisitions (M&A) policy. But Arnault told the Financial Times that LVMH was not hunting for acquisitions as \"fewer and fewer assets are looking attractive to us. And the best assets are not for sale.\"The Dior holding company owns 41 percent of the LVMH group and 100 percent of Christian Dior Couture, the home of the Lady Dior handbag.Arnault's family company will offer 172 euros per share and 0.192 Hermes ( HRMS.PA ) shares for each Dior holding company share. There are potential all-cash and all-share alternatives.Arnault has a stake of about 8 percent in luxury group Hermes ( HRMS.PA ), and Hermes' shares fell from earlier record highs on the prospect of more of the stock coming to the market.LVMH said the overall deal would boost earnings per share by some 3 percent within the first year of its completion, with the transactions expected to close during the second half of 2017.(Additional reporting by Blandine Henault; Editing by Andrew Callus and Mark Potter) \n",
"4043 BRUSSELS May 19 The following are mergers under review by the European Commission and a brief guide to the EU merger process:APPROVALS AND WITHDRAWALS-- U.S. packaging company WestRock to acquire U.S. peer Multi Packaging Solutions (approved May 18)-- Italian cinema operator The Space Cinema, which is controlled by Vue International Holdco Ltd, and Italian peer UCI Italian S.p.A. which is part of Chinese conglomerate Dalian Wanda Group, to set up a joint venture (approved May 18)-- Investment companies TPG and Oaktree to take joint control over Britain's Iona Energy Co, which owns 75 percent of two undeveloped oil fields in the North Sea and that will be active in crude oil production and sale (approved May 18)-- French aircraft engine and aerospace equipment company Safran and China Eastern Airlines Co. Ltd. to form joint venture to provide aircraft maintenance in China (approved May 18)-- Energy company Electricite de France, French state-owned bank Caisse des depots et consignations and Japan's Mitsubishi Corporation to create a joint venture NGM to finance electric mobility projects mainly in France (approved May 18)NEW LISTINGS-- Chinese conglomerate HNA Holding Group Co to acquire Singapore-listed logistics company CWT (notified May 18/deadline June 27/simplified)-- Buyout firm Blackstone and Canada Pension Plan Investment Board (CPPIB) to acquire indirect joint control of U.S. educational content provider Ascend Learning (notified May 18/deadline June 27/simplified)EXTENSIONS AND OTHER CHANGESNoneFIRST-STAGE REVIEWS BY DEADLINEMAY 22-- Investment firms Cinven Capital Management and Canada Pension Plan Investment Board to acquire joint control of Travel Holdings Parent Corporation (notified April 10/deadline May 22)MAY 29-- French EDF to acquire equipment and fuel manufacturing company Areva (notified April 18/deadline May 29)MAY 30-- French media group Vivendi to acquire de facto sole control of Italy's Telecom Italia (notified March 31/deadline extended to May 30 from May 12 after Vivendi offered concessions)MAY 31-- Manufacturing and technology company General Electric's Oil & Gas to acquire oilfield services company Baker Hughes (notified April 20/deadline May 31)JUNE 1-- Waste water company SGAB and Spanish infrastructure company Acciona to acquire 10 percent of Sociedad Concesionaria de la Zona Regable del Canal de Navarra (notified April 21/deadline June 1/simplified)JUNE 2-- Australian bank Macquarie and British pension fund Universities Superannuation Scheme to acquire Green Investment Bank (notified April 24/deadline June 2/simplified)JUNE 7-- German company CWS-Boco, which is part of German firm Haniel, to acquire some of British support services firm Rentokil's workwear and hygiene units (notified April 26/deadline June 7)JUNE 8-- German chemicals company Evonik Industries to acquire U.S. company J.M. Huber Corp's silica business (notified April 27/deadline June 8)JUNE 9-- Private equity firm Hellman & Friedman to acquire Spanish logistics platform Allfunds Bank (notified April 28/deadline June 9/simplified)-- U.S. smartphone chipmaker Qualcomm to acquire Dutch companyr NXP Semiconductors NV (notified April 28/deadline June 9)-- Chinese textiles company Shanghai Shenda to acquire International Automotive Components Group's trim and acoustics unit business (notified April 24/deadline June 9/simplified)JUNE 12-- American healthcare company Johnson & Johnson to acquire Swiss biotech company Actelion (notified April 12/deadline extended to June 12 from May 24 after the companies offered concessions)-- Norwegian debt collection agency Nordic Capital, which is majority owned by Nordic Capital Fund VIII and Swedish peer firm Intrum Justitia to merge (notified April 12/deadline extended to June 12 from May 24 after the companies offered concessions)JUNE 14-- Private equity firms BC Partners and Pollen Street Capital Ltd to jointly acquire UK bank Shawbrook Group plc (notified May 4/deadline June 14/simplified)JUNE 15-- U.S. private equity firm Leonard Green & Partners and the Ontario Municipal Employees Retirement System Primary Pension Plan (OMERS) to acquire joint control of U.S. car repairs company OPE Caliber Holdings (notified May 5/deadline June 15/simplified)-- Austrian refractories materials maker RHI to acquire a controlling stake in Brazilian peer Magnesita Refratarios (notified May 5/deadline June 15)JUNE 21-- Investment bank Goldman Sachs and French investment company Eurazeo to jointly acquire Dominion Web Solutions (notified May 12/deadline June 21/simplified)-- French private equity company Ardian France and real estate agent Jones Lang LaSalle Inc to jointly acquire an office building in France (notified May 12/deadline June 21/simplified)-- French minerals company Imerys to acquire French calcium aluminate cements maker Kerneos (notified May 12/deadline June 21)JUNE 22-- German online fashion retailer Zalando and fashion company Bestseller United to set up a joint venture (notified May 15/deadline June 22/simplified)JUNE 26-- Private equity firms Advent International and Bain Capital Investors to jointly acquire payment services company RatePAY (notified May 17/deadline June 26/simplified)-- Private equity firm Oaktree to acquire German nursing care provider Vitanas P&W (notified May 17/deadline June 26/simplified)GUIDE TO EU MERGER PROCESSDEADLINES:The European Commission has 25 working days after a deal is filed for a first-stage review. It may extend that by 10 working days to 35 working days, to consider either a company's proposed remedies or an EU member state's request to handle the case.Most mergers win approval but occasionally the Commission opens a detailed second-stage investigation for up to 90 additional working days, which it may extend to 105 working days.SIMPLIFIED:Under the simplified procedure, the Commission announces the clearance of uncontroversial first-stage mergers without giving any reason for its decision. Cases may be reclassified as non-simplified - that is, ordinary first-stage reviews - until they are approved. (Reporting by Foo Yun Chee) \n",
"5950 August 4, 2017 / 12:57 PM / 3 hours ago Canada's OneREIT to be taken private in a C$1.1 billion deal 1 Min Read (Reuters) - Canada's OneREIT ( ONR_u.TO ) said on Friday it would go private after being bought by SmartREIT and Strathallen Acquisitions Inc in a C$1.1 billion deal, including debt. Under the terms of the deal, shareholders of OneREIT, which owns and operates shopping centers in Canada, will receive C$4.26 per share in cash and SmartREIT unit. The company said it was exploring strategic alternatives earlier this year. Reporting by Ahmed Farhatha in Bengaluru; Editing by Arun Koyyur 0 : 0 \n",
"\n",
" Site SiteSection \\\n",
"2730 reuters.com http://feeds.reuters.com/reuters/INbusinessNews \n",
"4043 reuters.com http://in.reuters.com/finance/deals \n",
"5950 reuters.com http://in.reuters.com/finance/deals \n",
"\n",
" Url \\\n",
"2730 http://in.reuters.com/article/us-lvmh-dior-idINKBN17R0I1 \n",
"4043 http://in.reuters.com/article/eu-ma-idINL8N1IL4K0 \n",
"5950 https://in.reuters.com/article/us-onereit-m-a-smartreit-idINKBN1AK1II \n",
"\n",
" Timestamp Index Round Label Probability \\\n",
"2730 2017-04-25T17:50:00.000+03:00 2730.0 0.0 1.0 NaN \n",
"4043 2017-05-19T15:45:00.000+03:00 4043.0 0.0 1.0 NaN \n",
"5950 2017-08-04T10:57:00.000+03:00 5950.0 0.0 1.0 NaN \n",
"\n",
" Estimated \n",
"2730 -1.0 \n",
"4043 -1.0 \n",
"5950 -1.0 "
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Uuid</th>\n",
" <th>Title</th>\n",
" <th>Text</th>\n",
" <th>Site</th>\n",
" <th>SiteSection</th>\n",
" <th>Url</th>\n",
" <th>Timestamp</th>\n",
" <th>Index</th>\n",
" <th>Round</th>\n",
" <th>Label</th>\n",
" <th>Probability</th>\n",
" <th>Estimated</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>4352</th>\n",
" <td>752df0da1a3ecf70f57c017b798b2ceff37f557a</td>\n",
" <td>FTC to advise blocking Walgreens deal to buy Rite Aid - CNBC</td>\n",
" <td>June 9 Regulatory authorities are set to advise blocking U.S. drugstore chain Walgreens Boots Alliance Inc's deal to buy smaller rival Rite Aid Corp, CNBC reported on Friday, citing a report.The companies have been waiting for a year-and-a-half for approval from the Federal Trade Commission (FTC) since the initial offer made in 2015.In that time, the closing date of the deal has been postponed repeatedly and the offer price reduced to $6.50 to $7.00 per Rite Aid share, down from $9.The deal would have helped Walgreens widen its U.S. footprint and negotiate for lower drug costs. (Reporting by Sruthi Ramakrishnan in Bengaluru; Editing by Shounak Dasgupta)</td>\n",
" <td>reuters.com</td>\n",
" <td>http://in.reuters.com/finance/deals</td>\n",
" <td>http://in.reuters.com/article/rite-aid-ma-walgreens-boots-idINL3N1J64ON</td>\n",
" <td>2017-06-09T14:48:00.000+03:00</td>\n",
" <td>4352.0</td>\n",
" <td>0.0</td>\n",
" <td>2.0</td>\n",
" <td>NaN</td>\n",
" <td>-1.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1672</th>\n",
" <td>476ed99eb49bd466258784067ac62dd7815e05ab</td>\n",
" <td>UPDATE 1-Hansteen to sell German, Dutch industrial properties for $1.4 bln</td>\n",
" <td>Company News - Mon Mar 20, 2017 - 5:01am EDT UPDATE 1-Hansteen to sell German, Dutch industrial properties for $1.4 bln (Adds details, background, share movement) March 20 Britain's Hansteen Holdings has agreed to sell its German and Dutch industrial property portfolios for 1.28 billion euros ($1.38 billion) to a venture between Blackstone Group LP and M7 Real Estate. The price represents a premium of about 6 percent, or roughly 76 million euros, to the assets' valuations at the end of 2016, Hansteen said in a statement on Monday. Hansteen's shares rose more than 6 percent, before paring gains to trade up 3 percent at 125.55 pence at 0850 GMT. They were the top gainers on London's midcap index. \"This is a compelling opportunity to crystallise both the revaluation gains from these German and Dutch assets achieved by our active asset management and the gains from foreign exchange movements,\" Hansteen joint chief executives Morgan Jones and Ian Watson said. Last year, the industrial market outperformed all other European real estate sectors, including offices and retail, data from property consultant CBRE showed, as the sector benefited from higher demand for warehouses from retailers expanding their online operations. Over the fourth quarter, European commercial real estate deals reached a record high of 86.8 billion euros, boosted largely by a buoyant Germany market and growth in the Netherlands, according to the data. Hansteen, a UK real estate investment trust, said that the sale was expected to complete before the end of June and that it was advised by property consultant JLL. The sale leaves Hansteen with its UK business, where the market has seen some turbulence after Britain voted to leave the European Union. However, Hansteen said it had not noticed any significant effect on demand for industrial space following the June 23 vote. \"Across the UK, we are experiencing pockets of rental growth and shorter incentives being offered to tenants as demand intensifies,\" the company said. ($1 = 0.9288 euros) (Reporting by Esha Vaish in Bengaluru; Editing by Jason Neely and Alexander Smith) Next In Company News</td>\n",
" <td>reuters.com</td>\n",
" <td>http://feeds.reuters.com/reuters/companyNews</td>\n",
" <td>http://www.reuters.com/article/hansteen-divestiture-idUSL5N1GX0ZS</td>\n",
" <td>2017-03-20T16:01:00.000+02:00</td>\n",
" <td>1672.0</td>\n",
" <td>0.0</td>\n",
" <td>2.0</td>\n",
" <td>NaN</td>\n",
" <td>-1.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5052</th>\n",
" <td>370eed1e30c0117362432816475315a18ab67463</td>\n",
" <td>Italy could consider taking small stake in Alitalia - minister</td>\n",
" <td>July 19, 2017 / 1:38 PM / 8 minutes ago Italy could consider taking small stake in Alitalia - minister Reuters Staff 1 Min Read People walk in the Alitalia departure hall during a strike by Italy's national airline Alitalia workers at Fiumicino international airport in Rome, Italy July 24, 2015. Max Rossi - RTX1LMNG ROME (Reuters) - Italy could consider taking a small stake in struggling airline Alitalia, Transport Minister Graziano Delrio said on Wednesday. \"We are against nationalising (the airline) but the state taking a small stake could be a solution,\" Delrio told a parliamentary commission. He added that the special administrators appointed to run Alitalia after it filed for bankruptcy could stay in their roles for longer than originally planned. Reporting by Alberto Sisto, writing by Isla Binnie 0 : 0</td>\n",
" <td>reuters.com</td>\n",
" <td>http://feeds.reuters.com/Reuters/UKBusinessNews?format=xml</td>\n",
" <td>http://uk.reuters.com/article/uk-italy-alitalia-idUKKBN1A41GI</td>\n",
" <td>2017-07-19T16:38:00.000+03:00</td>\n",
" <td>5052.0</td>\n",
" <td>0.0</td>\n",
" <td>2.0</td>\n",
" <td>NaN</td>\n",
" <td>-1.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Uuid \\\n",
"4352 752df0da1a3ecf70f57c017b798b2ceff37f557a \n",
"1672 476ed99eb49bd466258784067ac62dd7815e05ab \n",
"5052 370eed1e30c0117362432816475315a18ab67463 \n",
"\n",
" Title \\\n",
"4352 FTC to advise blocking Walgreens deal to buy Rite Aid - CNBC \n",
"1672 UPDATE 1-Hansteen to sell German, Dutch industrial properties for $1.4 bln \n",
"5052 Italy could consider taking small stake in Alitalia - minister \n",
"\n",
" Text \\\n",
"4352 June 9 Regulatory authorities are set to advise blocking U.S. drugstore chain Walgreens Boots Alliance Inc's deal to buy smaller rival Rite Aid Corp, CNBC reported on Friday, citing a report.The companies have been waiting for a year-and-a-half for approval from the Federal Trade Commission (FTC) since the initial offer made in 2015.In that time, the closing date of the deal has been postponed repeatedly and the offer price reduced to $6.50 to $7.00 per Rite Aid share, down from $9.The deal would have helped Walgreens widen its U.S. footprint and negotiate for lower drug costs. (Reporting by Sruthi Ramakrishnan in Bengaluru; Editing by Shounak Dasgupta) \n",
"1672 Company News - Mon Mar 20, 2017 - 5:01am EDT UPDATE 1-Hansteen to sell German, Dutch industrial properties for $1.4 bln (Adds details, background, share movement) March 20 Britain's Hansteen Holdings has agreed to sell its German and Dutch industrial property portfolios for 1.28 billion euros ($1.38 billion) to a venture between Blackstone Group LP and M7 Real Estate. The price represents a premium of about 6 percent, or roughly 76 million euros, to the assets' valuations at the end of 2016, Hansteen said in a statement on Monday. Hansteen's shares rose more than 6 percent, before paring gains to trade up 3 percent at 125.55 pence at 0850 GMT. They were the top gainers on London's midcap index. \"This is a compelling opportunity to crystallise both the revaluation gains from these German and Dutch assets achieved by our active asset management and the gains from foreign exchange movements,\" Hansteen joint chief executives Morgan Jones and Ian Watson said. Last year, the industrial market outperformed all other European real estate sectors, including offices and retail, data from property consultant CBRE showed, as the sector benefited from higher demand for warehouses from retailers expanding their online operations. Over the fourth quarter, European commercial real estate deals reached a record high of 86.8 billion euros, boosted largely by a buoyant Germany market and growth in the Netherlands, according to the data. Hansteen, a UK real estate investment trust, said that the sale was expected to complete before the end of June and that it was advised by property consultant JLL. The sale leaves Hansteen with its UK business, where the market has seen some turbulence after Britain voted to leave the European Union. However, Hansteen said it had not noticed any significant effect on demand for industrial space following the June 23 vote. \"Across the UK, we are experiencing pockets of rental growth and shorter incentives being offered to tenants as demand intensifies,\" the company said. ($1 = 0.9288 euros) (Reporting by Esha Vaish in Bengaluru; Editing by Jason Neely and Alexander Smith) Next In Company News \n",
"5052 July 19, 2017 / 1:38 PM / 8 minutes ago Italy could consider taking small stake in Alitalia - minister Reuters Staff 1 Min Read People walk in the Alitalia departure hall during a strike by Italy's national airline Alitalia workers at Fiumicino international airport in Rome, Italy July 24, 2015. Max Rossi - RTX1LMNG ROME (Reuters) - Italy could consider taking a small stake in struggling airline Alitalia, Transport Minister Graziano Delrio said on Wednesday. \"We are against nationalising (the airline) but the state taking a small stake could be a solution,\" Delrio told a parliamentary commission. He added that the special administrators appointed to run Alitalia after it filed for bankruptcy could stay in their roles for longer than originally planned. Reporting by Alberto Sisto, writing by Isla Binnie 0 : 0 \n",
"\n",
" Site SiteSection \\\n",
"4352 reuters.com http://in.reuters.com/finance/deals \n",
"1672 reuters.com http://feeds.reuters.com/reuters/companyNews \n",
"5052 reuters.com http://feeds.reuters.com/Reuters/UKBusinessNews?format=xml \n",
"\n",
" Url \\\n",
"4352 http://in.reuters.com/article/rite-aid-ma-walgreens-boots-idINL3N1J64ON \n",
"1672 http://www.reuters.com/article/hansteen-divestiture-idUSL5N1GX0ZS \n",
"5052 http://uk.reuters.com/article/uk-italy-alitalia-idUKKBN1A41GI \n",
"\n",
" Timestamp Index Round Label Probability \\\n",
"4352 2017-06-09T14:48:00.000+03:00 4352.0 0.0 2.0 NaN \n",
"1672 2017-03-20T16:01:00.000+02:00 1672.0 0.0 2.0 NaN \n",
"5052 2017-07-19T16:38:00.000+03:00 5052.0 0.0 2.0 NaN \n",
"\n",
" Estimated \n",
"4352 -1.0 \n",
"1672 -1.0 \n",
"5052 -1.0 "
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# neu für CV: \n",
"#training_data = df.loc[(df['Round'] < m)]"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[9819.0, 850.0, 3242.0, 2730.0, 4043.0, 5950.0, 4352.0, 1672.0, 5052.0]"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# testing data: manually labeled articles of current round\n",
"testing_data = df.loc[(df['Round'] == m)]\n",
"len(testing_data)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# classical model fitting:\n",
"\n",
"# all labeled as samples\n",
"#training_data = pd.concat([set_0, set_1, set_2])\n",
"\n",
"#m += 1\n",
"#m\n",
"\n",
"# classical model fitting:\n",
"#all_labeled_data = df.loc[(df['Round'] <= m)].reset_index()\n",
"#recall_scores, precision_scores, f1_scores = MultinomialNaiveBayes.make_mnb(all_labeled_data)\n",
"#recall_score = sum(recall_scores)/len(recall_scores)\n",
"#print('recall: {}'.format(recall_score))\n",
"#precision_score = sum(precision_scores)/len(precision_scores)\n",
"#print('precision: {}'.format(precision_score))\n",
"#f1_score = sum(f1_scores)/len(f1_scores)\n",
"#print('f1 score: {}'.format(f1_score))\n",
"\n",
"# stratified sampled:\n",
"#training_data = pd.concat([set_0[:strat], set_1[:strat], set_2[:strat]])\n",
"\n",
"#len(training_data)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We now check (and correct if necessary) the next auto-labeled articles."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Multinomial Naive Bayes Classification: ##"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# use sklearn's CountVectorizer\n",
"# cv = False\n",
"cv = True\n",
"\n",
"# call script with manually labeled and manually unlabeled samples\n",
"%time classes, class_count, class_probs = MNBInteractive.estimate_mnb(training_data, testing_data, cv)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"m"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# annotate highest estimated probability for every instance\n",
"maxima = []\n",
"\n",
"for row in class_probs:\n",
" maxima.append(np.amax(row))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#maxima"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# save class_probs array\n",
"#with open('../obj/'+ 'array_class_probs_round_{}_stratified'.format(m) + '.pkl', 'wb') as f:\n",
" pickle.dump(maxima, f, pickle.HIGHEST_PROTOCOL)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": false
},
"outputs": [],
"source": [
"# sort list in descending order\n",
"maxima.sort(reverse=True)\n",
"\n",
"# convert list to array\n",
"probas = np.asarray(maxima)\n",
"\n",
"n_bins = 50\n",
"\n",
"fig, ax = plt.subplots(figsize=(8, 4))\n",
"\n",
"# plot the cumulative histogram\n",
"n, bins, patches = ax.hist(probas, n_bins, density=1, histtype='step',\n",
" cumulative=True, facecolor='darkred')\n",
"\n",
"ax.grid(True)\n",
"#ax.set_title('Cumulative distribution of highest estimated probability')\n",
"ax.set_xlabel('Highest estimated probability')\n",
"ax.set_ylabel('Fraction of articles with this highest estimated probability')\n",
"#plt.axis([0.5, 1, 0, 0.02])\n",
"#ax.set_xbound(lower=0.5, upper=0.99)\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"plt.savefig('..\\\\visualization\\\\proba_after_round_{}_stratified.png'.format(m))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We annotate each article's estimated class with its probability in columns 'Estimated' and 'Probability':"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# series of indices of recently estimated articles \n",
"indices_estimated = df.loc[df['Round'] == m, 'Index'].tolist()\n",
"\n",
"n = 0 \n",
"for row in class_probs:\n",
" for i in range(0, len(classes)):\n",
" index = indices_estimated[n]\n",
" # save estimated label\n",
" if np.amax(row) == row[i]:\n",
" df.loc[index, 'Estimated'] = classes[i]\n",
" # annotate probability\n",
" df.loc[index, 'Probability'] = row[i]\n",
" n += 1"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"m = 16"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"###############\n"
]
},
{
"data": {
"text/plain": [
"5"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"text/plain": [
"0"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"text/plain": [
"1"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"###############\n"
]
},
{
"data": {
"text/plain": [
"2"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"text/plain": [
"1"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"text/plain": [
"0"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"###############\n"
]
},
{
"data": {
"text/plain": [
"1"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"text/plain": [
"0"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"text/plain": [
"0"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"###############\n"
]
},
{
"data": {
"text/plain": [
"5"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"text/plain": [
"1"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"text/plain": [
"1"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"text/plain": [
"3"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"###############\n"
]
},
{
"data": {
"text/plain": [
"1"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"text/plain": [
"7"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"text/plain": [
"2"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"text/plain": [
"0"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"###############\n"
]
},
{
"data": {
"text/plain": [
"0"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"text/plain": [
"8"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"text/plain": [
"1"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"text/plain": [
"1"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"###############\n"
]
},
{
"data": {
"text/plain": [
"83.33333333333334"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"text/plain": [
"62.5"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"text/plain": [
"60.0"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"###############\n"
]
},
{
"data": {
"text/plain": [
"33.33333333333333"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"text/plain": [
"100.0"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"text/plain": [
"80.0"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"###############\n"
]
},
{
"data": {
"text/plain": [
"0.0"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"text/plain": [
"0.0"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"text/plain": [
"80.0"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"###############\n"
]
},
{
"data": {
"text/plain": [
"38.88888888888889"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"text/plain": [
"54.166666666666664"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"text/plain": [
"73.33333333333333"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"print('###############')\n",
"zero_0 = len(df.loc[(df['Round'] == m) & (df['Estimated'] == 0) & (df['Label'] == 0)])\n",
"zero_0\n",
"zero_1 = len(df.loc[(df['Round'] == m) & (df['Estimated'] == 0) & (df['Label'] == 1)])\n",
"zero_1\n",
"zero_2 = len(df.loc[(df['Round'] == m) & (df['Estimated'] == 0) & (df['Label'] == 2)])\n",
"zero_2\n",
"print('###############')\n",
"one_0 = len(df.loc[(df['Round'] == m) & (df['Estimated'] == 1) & (df['Label'] == 0)])\n",
"one_0\n",
"one_1 = len(df.loc[(df['Round'] == m) & (df['Estimated'] == 1) & (df['Label'] == 1)])\n",
"one_1\n",
"one_2 = len(df.loc[(df['Round'] == m) & (df['Estimated'] == 1) & (df['Label'] == 2)])\n",
"one_2\n",
"print('###############')\n",
"\n",
"two_0 = len(df.loc[(df['Round'] == m) & (df['Estimated'] == 2) & (df['Label'] == 0)])\n",
"two_0\n",
"two_1 = len(df.loc[(df['Round'] == m) & (df['Estimated'] == 2) & (df['Label'] == 1)])\n",
"two_1\n",
"two_2 = len(df.loc[(df['Round'] == m) & (df['Estimated'] == 2) & (df['Label'] == 2)])\n",
"two_2\n",
"print('###############')\n",
"\n",
"total = zero_0 + zero_1 + zero_2 + one_0 + one_1 + one_2 + two_0 + two_1 + two_2\n",
"\n",
"tp_0 = zero_0\n",
"tp_0\n",
"tn_0 = one_1 + one_2 + two_1 + two_2\n",
"tn_0\n",
"fp_0 = zero_1 + zero_2\n",
"fp_0\n",
"fn_0 = one_0 + two_0\n",
"fn_0\n",
"print('###############')\n",
"\n",
"tp_1 = one_1\n",
"tp_1\n",
"tn_1 = zero_0 + zero_2 + two_0 + two_2\n",
"tn_1\n",
"fp_1 = one_0 + one_2\n",
"fp_1\n",
"fn_1 = zero_1 + two_1\n",
"fn_1\n",
"print('###############')\n",
"\n",
"tp_2 = two_2\n",
"tp_2\n",
"tn_2 = zero_0 + zero_1 + one_0 + one_1\n",
"tn_2\n",
"fp_2 = two_0 + two_1\n",
"fp_2\n",
"fn_2 = zero_2 + one_2\n",
"fn_2\n",
"print('###############')\n",
"\n",
"prec_0 = tp_0 / (tp_0 + fp_0) * 100\n",
"prec_0\n",
"rec_0 = tp_0 / (tp_0 + fn_0) * 100\n",
"rec_0\n",
"acc_0 = (tp_0 + tn_0) / total * 100\n",
"acc_0\n",
"print('###############')\n",
"\n",
"prec_1 = tp_1 / (tp_1 + fp_1) * 100\n",
"prec_1\n",
"rec_1 = tp_1 / (tp_1 + fn_1) * 100\n",
"rec_1\n",
"acc_1 = (tp_1 + tn_1) / total * 100\n",
"acc_1\n",
"print('###############')\n",
"\n",
"prec_2 = tp_2 / (tp_2 + fp_2) * 100\n",
"prec_2\n",
"rec_2 = tp_2 / (tp_2 + fn_2) * 100\n",
"rec_2\n",
"acc_2 = (tp_2 + tn_2) / total * 100\n",
"acc_2\n",
"print('###############')\n",
"\n",
"(prec_1 + prec_2 + prec_0) / 3\n",
"(rec_1 + rec_2 + rec_0) / 3\n",
"(acc_1 + acc_2 + acc_0) / 3"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# save round\n",
"df.to_csv('../data/interactive_labeling_round_{}.csv'.format(m),\n",
" sep='|',\n",
" mode='w',\n",
" encoding='utf-8',\n",
" quoting=csv.QUOTE_NONNUMERIC,\n",
" quotechar='\\'')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#df.loc[:10]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Manual Labeling: ##"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Find new threshold for labeling:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"threshold = 0.9997\n",
"\n",
"n = 0\n",
"for max in maxima:\n",
" if max < threshold:\n",
" n += 1\n",
"n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print('Number of articles with estimated probability < {}: {}'.format(threshold, len(df.loc[(df['Label'] == -1) & (df['Probability'] < threshold)])))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Check articles with probability under threshold:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# pick articles with P < 0.99:\n",
"label_next = df.loc[(df['Label'] == -1) & (df['Probability'] < threshold), 'Index'].tolist()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"len(label_next)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print('Estimated labels to be checked: class 0: {}, class 1: {}, class 2: {}'\n",
" .format(len(df.loc[(df['Label'] == -1) & (df['Probability'] < threshold) & (df['Estimated'] == 0.0)]), \n",
" len(df.loc[(df['Label'] == -1) & (df['Probability'] < threshold) & (df['Estimated'] == 1.0)]),\n",
" len(df.loc[(df['Label'] == -1) & (df['Probability'] < threshold) & (df['Estimated'] == 2.0)])))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# increment round number\n",
"m += 1\n",
"print('This round number: {}'.format(m))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"PLEASE READ THE FOLLOWING ARTICLES AND ENTER THE CORRESPONDING LABELS:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"for index in label_next:\n",
" show_next(index)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#df.loc[(df['Round'] == m) & (df['Index'].isin(label_next)), 'Round'] = m"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print('Number of manual labels in round no. {}:'.format(m))\n",
"print('0:{}, 1:{}, 2:{}'.format(len(df.loc[(df['Label'] == 0) & (df['Round'] == m)]), len(df.loc[(df['Label'] == 1) & (df['Round'] == m)]), len(df.loc[(df['Label'] == 2) & (df['Round'] == m)])))\n",
"\n",
"print('Number of articles to be corrected in this round: {}'.format(len(df.loc[(df['Label'] != -1) & (df['Estimated'] != -1) & (df['Round'] == m) & (df['Label'] != df['Estimated'])])))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# save intermediate status\n",
"df.to_csv('../data/interactive_labeling_round_{}_temp.csv'.format(m),\n",
" sep='|',\n",
" mode='w',\n",
" encoding='utf-8',\n",
" quoting=csv.QUOTE_NONNUMERIC,\n",
" quotechar='\\'')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Resubstitution error: Multinomial Naive Bayes ##"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"X_train_test = df.loc[df['Label'] != -1, 'Title'] + ' ' + df.loc[df['Label'] != -1, 'Text']\n",
"y_train_test = df.loc[df['Label'] != -1, 'Label']"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# discard old indices\n",
"y_train_test = y_train_test.reset_index(drop=True)\n",
"X_train_test = X_train_test.reset_index(drop=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# use my own BagOfWords python implementation\n",
"stemming = True\n",
"rel_freq = True\n",
"extracted_words = BagOfWords.extract_all_words(X_train_test)\n",
"vocab = BagOfWords.make_vocab(extracted_words)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# fit the training data and return the matrix\n",
"training_data = BagOfWords.make_matrix(extracted_words, vocab, rel_freq, stemming)\n",
"testing_data = training_data"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Naive Bayes\n",
"classifier = MultinomialNB(alpha=1.0e-10, fit_prior=False, class_prior=None)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# optional, nur bei resubstitutionsfehler\n",
"\n",
"n = 0\n",
"for i in range(len(y_train_test)):\n",
" if y_train_test[i] != predictions[i]:\n",
" n += 1\n",
" print('error no.{}'.format(n))\n",
" print('prediction at index {} is: {}, but actual is: {}'.format(i, predictions[i], y_train_test[i]))\n",
" print(X_train_test[i])\n",
" print(y_train_test[i])\n",
" print()\n",
"if n==0:\n",
" print('no resubstitution error :-)')\n",
"else:\n",
" print('number of wrong estimated articles: {}'.format(n))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print('End of this round (no. {}):'.format(m))\n",
"print('Number of manually labeled articles: {}'.format(len(df.loc[df['Label'] != -1])))\n",
"print('Number of manually unlabeled articles: {}'.format(len(df.loc[df['Label'] == -1])))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# save this round to csv\n",
"df.to_csv('../data/interactive_labeling_round_{}_neu.csv'.format(m),\n",
" sep='|',\n",
" mode='w',\n",
" encoding='utf-8',\n",
" quoting=csv.QUOTE_NONNUMERIC,\n",
" quotechar='\\'')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"NOW PLEASE CONTINUE WITH PART II.\n",
"REPEAT UNTIL ALL SAMPLES ARE LABELED."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.1"
}
},
"nbformat": 4,
"nbformat_minor": 2
}