Given a classification problem, there is always a labeled data set needed first to apply a machine learning model and make predictions possible. The larger the labeled data set is, the better are generally the predictions. However, to get there, each single data element must first be classified manually. Depending on the type of data, this procedure can be very time-consuming, for example if longer texts have to be read.
In this thesis we want to present an alternative data labeling method that allows to label a larger amount of data in a shorter time.
\section{Goals}
\label{sec:goals}
\jk{Ein Satz welcher das Problem beschreibt, dannach dann runtergebrochen in Teilaufgaben}
We want to compare a conventional method of data labeling with an alternative, incremental method using the following example: The aim is to investigate news articles about recent mergers ('mergers and acquisitions') and to classify them accordingly. With the help of the labeled data set, different classification models will be applied and optimized so that a prediction about future news articles will be possible.
As a source for our initial data set, RSS feeds from established business news agencies such as \textit{Reuters} or \textit{Bloomberg} come into consideration. However, when crawling RSS feeds, it is not possible to retrieve news from a longer period in the past. Since we want to analyze news of the period of 12 months, we obtain the data set from the provider \textit{webhose.io}\footnote{\url{<https://webhose.io/>}}. It offers access to English news articles from sections like \textit{Financial News}, \textit{Finance} and \textit{Business} at affordable fees compared to the news agencies' offers. As we are only interested in reliable sources, we limit our request to the websites of the news agengies \textit{Reuters, Bloomberg, Financial Times, CNN, The Economist} and \textit{The Guardian}.
\jk{Was muss insgesamt gemacht werden, welche Teilprobleme müssen addressiert werden. Alternativen besprechen, Entscheidungen fällen basierend auf Kriterien. Hier kommt Deine Arbeit hin, kein Related work oder Methoden die es schon gibt. Nur falls man es Vergleicht, dann relevant.}
First, we need to collect appropriate data, then label a data set manually, then, ....\\
\\
% Data Processing Pipeline als Schaubild einfügen:
Data Selection > Labeling > Preprocessing > Model Selection > Recognition of Merger Partners
\section{Data Selection}
\label{sec:data_selection}
Before we can start with the data processing, we need to identify and select appropriate data. We downloaded news articles of 12 months (year 2017) from the website \url{<webhose.io>} as described in Chapter \ref{chap:implementation}, Section \ref{sec:data_download}.
As webhose.io is a secondary source and only crawls the news feeds itself, it may occur that some RSS feeds are not parsed correctly or a article is tagged with a wrong topic as \textit{site categories}. The downloaded files also contain blog entries, user comments, videos or graphical content and other spam which we have to filter out. We also do not need pages quoting Reuters etc.. Besides this, we are only interested in English news articles. \\
After we have filtered out all the irrelevant data, we receive a data set of XX.XXX news articles that we store in a csv file.
The csv file contains the following 9 columns:
\begin{center}
\begin{tabular}{|c|c|c|c|c|c|c|c|c|}
\hline
SectionTitle & Title & SiteSection & Text & Uuid & Timestamp & Site & SiteFull & Url \\
\item\textbf{SiteSection:} The link to the section of the site where the thread was created, e.g. \textit{'http://feeds.reuters.com/reuters/financialsNews'}
\item\textbf{Uuid:} Universally unique identifier, representing the article's thread.
\item\textbf{Timestamp:} The thread's publishing date/time in the format YYYY-MM-DDThh:mmGMT+3. E.g. \textit{2018-09-17T20:00:00.000+03:00'}
\item\textbf{Site:} The top level domain of the article's site, e.g. \textit{'reuters.com'}
\item\textbf{SiteFull:} The complete domain of the article's site, e.g. \textit{'reuters.com'}
\item\textbf{Url:} The link to the top of the article's thread, e.g. \textit{'https://www.reuters.com/article/us-github-m-a-microsoft-eu/eu-antitrust-ruling-on-microsoft-buy-of-github-due-by-october-19-idUSKCN1LX114'}
\end{itemize}
The columns \textbf{Title} and \textbf{Text} contain our main data, whereas the rest of the attributes is the meta data.
\section{Labeling}
From our dataset of XX.XXX news articles, we select 10.000 articles \footnote{833/844 articles of each month} to proceed with the labeling process.
First, we label a slightly smaller data set in a conventional way. The dataset consists of 1497 news articles, which were downloaded via \textit{webhose.io}. The dataset contains news articles from different Reuters' RSS feeds dating from the period of one month \footnote{The timeframe is May 25 - June 25 2018, retrieved on June 25 2018.}. Here, we only filter out articles that contain at least one of the keywords \textit{'merger', 'acquisition', 'take over', 'deal', 'transaction'} or \textit{'buy'} in the heading.
With the following query we download the desired data from \textit{webhose.io}:\\\\
Some article texts were difficult to classify even when read carefully.
Here are a few examples of the difficulties that showed up:
\begin{itemize}
\item\textit{'Company A acquires more than 50\% of the shares of company B.'}\\ => How should share sales be handled? Actually, this means a change of ownership, even if it is not a real merger.
\item\textit{'Company X will buy/wants to buy company Y.'}\\=> Will the merger definitely take place? On what circumstances does it depend?
\item\textit{'Last year company X and company Y merged. Now company A wants to invest more in renewable energies.'}\\ => Only an incidental remark deals with a merger that is not taking place right now. The main topic of the article is about something completely different.
\end{itemize}
These difficulties led to the idea of using different labeling classes, which we finally implemented in the interactive labeling method.
For the interactive labeling method, we use the data set of 10.000 articles from a whole year described in Chapter \ref{chap:design}, Section \ref{sec:data_selection}.
%Es ist wahrscheinlich dann man nur Merger mit vielen Artikeln hat => Das könnte man minimieren indem man “stratified” sampling macht => Zuerst NER machen, danach fair über Klassen randomisieren => wähle 10 Artikel von 100 Kategorien aus => 10 Kategorien auswählen => darunter zufällig ein Artikel . Labeln von 1\% aller Artikel
%1) Erste Modelle bauen z.b. Bayes . Auf alle Artikel anwenden => Wahrscheinlichkeit pro Klasse Vektor: (K1, K2, … , K6)
%Klare Fälle: Kx > 80\% und alle anderen Ky < 10\% (mit x in {1-6}, y != x)
%=> Label übernehmen => wie viele Fälle sind eindeutig?
%Behauptung: 10\% aller Artikel sind eindeutig
%Stichprobenartig überprüfen => 10 Artikel random auswählen von jeder Klasse
%Identifikation von äußert unklaren Fällen
%Mehr als eine Klasse hat ähnliche Wahrscheinlichkeit
%(5\%, 5\%, 5\%, …) => (80\%, 80\%, 0\%, 0\%, …)
%z.b. 100 Artikel angucken und manuell label
%=> Wiederhole ich 3-4 mal gehe zu Schritt 1) (Modell bauen)
%=> 95\% aller Fälle sind jetzt klar.
%=> warum gehen die 5\% nicht? Stichprobenartig Artikel anschauen
%Falls das nicht klappt, Modelle oder Preprozessing (z.b. NER) verbessern
To retrieve our data, we make the following request on the website
\url{<https://webhose.io>}:\\\\
\texttt{
site:(reuters.com OR ft.com OR cnn.com OR economist.com\\
\noindent\hspace*{12mm}%
OR bloomberg.com OR theguardian.com)\\
site\_category:(financial\_news OR finance OR business)\\
\\
timeframe: january 2017 - december 2017}\\
\\
The requested data was downloaded in September 2018 with JSON as file format. Every news article is saved in a single file, in total 1.478.508 files were downloaded (4,69 GiB).
Among others, one JSON file contains the information shown in the following example :\\
"title": "EU antitrust ruling on Microsoft buy of GitHub due by October 19",
"text": "BRUSSELS (Reuters)-EU antitrust regulators will decide by Oct. 19 whether to clear U.S. software giant Microsoft's $7.5 billion dollar acquisition of privately held coding website GitHub. Microsoft, which wants to acquire the firm to reinforce its cloud computing business against rival Amazon, requested European Union approval for the deal last Friday, a filing on the European Commission website showed on Monday. The EU competition enforcer can either give the green light with or without demanding concessions, or it can open a full-scale investigation if it has serious concerns. GitHub, the world's largest code host with more than 28 million developers using its platform, is Microsoft's largest takeover since the company bought LinkedIn for $26 billion in 2016. Microsoft Chief Executive Satya Nadella has tried to assuage users' worries that GitHub might favor Microsoft products over competitors after the deal, saying GitHub would continue to be an open platform that works with all the public clouds. Reporting by Foo Yun Chee; Editing by Edmund Blair",
Hiermit versichere ich an Eides statt, dass ich die vorliegende Arbeit im Bachelorstudiengang Wirtschaftsinformatik selbstständig verfasst und keine anderen als die angegebenen Hilfsmittel – insbesondere keine im Quellenverzeichnis nicht benannten Internet-Quellen – benutzt habe. Alle Stellen, die wörtlich oder sinngemäß aus Veröffentlichungen entnommen wurden, sind als solche kenntlich gemacht. Ich versichere weiterhin, dass ich die Arbeit vorher nicht in einem anderen Prüfungsverfahren eingereicht habe und die eingereichte schriftliche Fassung der auf dem elektronischen Speichermedium entspricht.