2018-10-18 20:10:11 +00:00
# Prediction of Company Mergers (Bachelorthesis Anne)
2018-09-18 05:34:30 +00:00
2018-10-18 20:10:11 +00:00
This project contains python classes for text mining, machine learning models, …
The csv file *classification_labelled_corrected.csv* contains 1497 labeled news articles from *Reuters.com* and is used for the machine learning models.
2018-10-18 11:57:46 +00:00
2018-10-18 20:10:11 +00:00
**Best F1 score results**:
2018-10-18 12:32:46 +00:00
2018-10-18 20:10:11 +00:00
* **Support Vector Machines Classifier (SVM):**
F1 score: 0.8944166649330559
Best parameters set found on development set:
2018-10-18 12:32:46 +00:00
{'SVC__C': 0.1, 'SVC__gamma': 0.01, 'SVC__kernel': 'linear', 'perc__percentile': 50}
2018-10-18 20:10:11 +00:00
* **Naive Bayes Classifier**:
F1 score: 0.8324014738144634 (average)
Parameters: SelectPercentile(25), own Bag of Words implementation, 10-fold cross validation
2018-10-18 12:32:46 +00:00
2018-10-18 20:10:11 +00:00
The complete documentation can be found in the latex document in the *thesis* folder.
2018-09-17 12:47:50 +00:00
2018-09-20 08:37:18 +00:00
## Installation under Windows
2018-10-18 20:10:11 +00:00
```bash
$ pip install xy
```
### Requirements
pandas==0.20.1
nltk==3.2.5
webhoseio==0.5
numpy==1.14.0
graphviz==0.9
scikit_learn==0.19.2
2018-09-20 08:37:18 +00:00
2018-10-18 20:10:11 +00:00
## Usage
The scripts can be called separately.
You need to enter a valid personal key for *webhose.io* before you call *Requester.py* .
To run *NER.py* you need to change the path to the JAVAHOME environment variable in *find_companies* method.
---
2018-09-20 08:37:18 +00:00
2018-10-18 20:10:11 +00:00
**Author:** Anne Lorenz / Datavard AG
2018-09-18 05:34:30 +00:00
2018-10-18 20:10:11 +00:00
**Project Status:** work in progress