2018-09-18 05:34:30 +00:00
|
|
|
# Anne's Bachelor Thesis
|
2018-10-18 11:57:46 +00:00
|
|
|
State: October 2018 (in progress)
|
2018-09-18 05:34:30 +00:00
|
|
|
|
|
|
|
My python classes for text mining, machine learning models, …
|
2018-10-18 11:57:46 +00:00
|
|
|
The scripts can be called separately.
|
|
|
|
|
|
|
|
Best F1 score results were:
|
|
|
|
|
|
|
|
SVM
|
|
|
|
---
|
|
|
|
F1 score: 0.8944166649330559
|
|
|
|
best parameters set found on development set:
|
|
|
|
{'SVC__C': 0.1, 'SVC__gamma': 0.01, 'SVC__kernel': 'linear', 'perc__percentile': 50}
|
|
|
|
|
|
|
|
Naive Bayes
|
|
|
|
-----------
|
|
|
|
parameters: SelectPercentile(25), own BOW implementation, 10-fold cross validation
|
|
|
|
F1 score: min = 0.7586206896551724, max = 0.8846153846153846, average = 0.8324014738144634
|
|
|
|
|
|
|
|
The complete documentation can be found in the latex document in the thesis folder.
|
|
|
|
|
|
|
|
The csv file 'classification_labelled_corrected.csv' contains 1497 labeled news articles from Reuters.com and is used for the machine learning models.
|
|
|
|
|
|
|
|
Note:
|
|
|
|
Please enter a valid webhose personal key before you call 'Requester.py'.
|
|
|
|
Also, please change the path to your JAVAHOME environment variable in 'NER.find_companies' method.
|
|
|
|
|
|
|
|
example:
|
|
|
|
# set paths
|
|
|
|
java_path = "C:\\Program Files (x86)\\Java\\jre1.8.0_181"
|
|
|
|
os.environ['JAVAHOME'] = java_path
|
|
|
|
|
2018-09-18 05:34:30 +00:00
|
|
|
|
|
|
|
## Requirements
|
2018-09-13 07:25:35 +00:00
|
|
|
|
2018-09-17 12:47:50 +00:00
|
|
|
pandas==0.20.1
|
|
|
|
nltk==3.2.5
|
|
|
|
webhoseio==0.5
|
|
|
|
numpy==1.14.0
|
|
|
|
graphviz==0.9
|
|
|
|
scikit_learn==0.19.2
|
|
|
|
|
2018-09-20 08:37:18 +00:00
|
|
|
## Installation under Windows
|
|
|
|
|
|
|
|
pip install XY
|
|
|
|
|
|
|
|
## Installation under UBUNTU
|
2018-09-18 05:34:30 +00:00
|
|
|
|
2018-09-14 16:44:10 +00:00
|
|
|
apt-get install XX
|