my python classes for text mining, machine learning models, …
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
Anne Lorenz 7c3353edab update labeling / documentation 6 months ago
data update labeling / documentation 6 months ago
obj update labeling / documentation 6 months ago
src update labeling / documentation 6 months ago
stanford-ner-2018-02-27 Update 1 year ago
visualization update labeling / documentation 6 months ago
.gitignore Initial commit 1 year ago
README.md corrected calculation of precision 1 year ago

README.md

Predictor for Company Mergers

(Bachelorthesis Anne)

This project contains python classes for text mining and machine learning models to recognize company mergers in news articles. The csv file classification_labelled_corrected.csv contains 1497 labeled news articles from Reuters.com and is used for the machine learning models.

Best F1 score results:

  • Support Vector Machines Classifier (SVM):
    F1 score: 0.894
    Best parameters set found on development set: {‘SVC__C’: 0.1, ‘SVC__gamma’: 0.01, ‘SVC__kernel’: ‘linear’, ‘perc__percentile’: 50}

  • Naive Bayes Classifier:
    F1 score: 0.841 (average)
    Parameters: SelectPercentile(100), own Bag of Words implementation, 10-fold cross validation

The complete documentation can be found in the latex document in the thesis folder.

Installation under Windows

$ pip install xy

Requirements

pandas==0.20.1
nltk==3.2.5
webhoseio==0.5
numpy==1.14.0
graphviz==0.9
scikit_learn==0.19.2

Usage

The scripts can be called separately. You need to enter a valid personal key for webhose.io before you call Requester.py. To run NER.py you need to change the path to the JAVAHOME environment variable in find_companies method.


Author: Anne Lorenz / Datavard AG

Project Status: work in progress