my python classes for text mining, machine learning models, …
Go to file
Anne Lorenz 7e037a1621 changed CountVectorizer optional and other things 2018-11-01 14:03:17 +01:00
data changes for data exploration 2018-10-26 11:30:19 +02:00
stanford-ner-2018-02-27 Update 2018-10-18 10:48:07 +02:00
thesis changes for data exploration 2018-10-26 11:30:19 +02:00
.gitignore Initial commit 2018-09-13 07:25:35 +00:00
BagOfWords.py changed CountVectorizer optional and other things 2018-11-01 14:03:17 +01:00
CosineSimilarity.py changed CountVectorizer optional and other things 2018-11-01 14:03:17 +01:00
DecisionTree.py changed CountVectorizer optional and other things 2018-11-01 14:03:17 +01:00
FileHandler.py changes for data exploration 2018-10-26 11:30:19 +02:00
FilterKeywords.py changed CountVectorizer optional and other things 2018-11-01 14:03:17 +01:00
NER.py changes for data exploration 2018-10-26 11:30:19 +02:00
NaiveBayes.py changed CountVectorizer optional and other things 2018-11-01 14:03:17 +01:00
NaiveBayes_Interactive.py changed CountVectorizer optional and other things 2018-11-01 14:03:17 +01:00
README.md corrected calculation of precision 2018-10-19 10:28:26 +02:00
Requester.py changes 2018-10-18 12:14:53 +02:00
SVM.py changed CountVectorizer optional and other things 2018-11-01 14:03:17 +01:00
VisualizerNews.py changed CountVectorizer optional and other things 2018-11-01 14:03:17 +01:00

README.md

Predictor for Company Mergers

(Bachelorthesis Anne)

This project contains python classes for text mining and machine learning models to recognize company mergers in news articles. The csv file classification_labelled_corrected.csv contains 1497 labeled news articles from Reuters.com and is used for the machine learning models.

Best F1 score results:

  • Support Vector Machines Classifier (SVM):
    F1 score: 0.894
    Best parameters set found on development set: {'SVC__C': 0.1, 'SVC__gamma': 0.01, 'SVC__kernel': 'linear', 'perc__percentile': 50}

  • Naive Bayes Classifier:
    F1 score: 0.841 (average)
    Parameters: SelectPercentile(100), own Bag of Words implementation, 10-fold cross validation

The complete documentation can be found in the latex document in the thesis folder.

Installation under Windows

$ pip install xy

Requirements

pandas==0.20.1
nltk==3.2.5
webhoseio==0.5
numpy==1.14.0
graphviz==0.9
scikit_learn==0.19.2

Usage

The scripts can be called separately. You need to enter a valid personal key for webhose.io before you call Requester.py. To run NER.py you need to change the path to the JAVAHOME environment variable in find_companies method.


Author: Anne Lorenz / Datavard AG

Project Status: work in progress