julian.kunkel/thesis-anne

Fork 0

Go to file

Anne Lorenz c85ce71e24 removed csvHandler.py

2018-10-18 13:57:46 +02:00

stanford-ner-2018-02-27

Update

2018-10-18 10:48:07 +02:00

thesis

Update

2018-10-18 10:48:07 +02:00

.gitignore

deleted .gitignore

2018-09-14 09:19:12 +02:00

BagOfWords.py

Update

2018-10-18 10:48:07 +02:00

classification_labelled_corrected.csv

added new files

2018-09-07 14:16:47 +02:00

CosineSimilarity.py

added requirements and some things

2018-09-17 14:47:50 +02:00

DecisionTree.py

removed csvHandler.py

2018-10-18 13:57:46 +02:00

FilterKeywords.py

dict -> defaultdict

2018-10-18 11:16:19 +02:00

JSONHandler.py

removed csvHandler.py

2018-10-18 13:57:46 +02:00

NaiveBayes_Interactive.py

removed csvHandler.py

2018-10-18 13:57:46 +02:00

NaiveBayes.py

removed csvHandler.py

2018-10-18 13:57:46 +02:00

NER.py

Update

2018-10-18 10:48:07 +02:00

README.md

removed csvHandler.py

2018-10-18 13:57:46 +02:00

Requester.py

changes

2018-10-18 12:14:53 +02:00

SVM.py

removed csvHandler.py

2018-10-18 13:57:46 +02:00

README.md

Anne's Bachelor Thesis

State: October 2018 (in progress)

My python classes for text mining, machine learning models, … The scripts can be called separately.

Best F1 score results were:

SVM

F1 score: 0.8944166649330559 best parameters set found on development set: {'SVC__C': 0.1, 'SVC__gamma': 0.01, 'SVC__kernel': 'linear', 'perc__percentile': 50}

Naive Bayes

parameters: SelectPercentile(25), own BOW implementation, 10-fold cross validation F1 score: min = 0.7586206896551724, max = 0.8846153846153846, average = 0.8324014738144634

The complete documentation can be found in the latex document in the thesis folder.

The csv file 'classification_labelled_corrected.csv' contains 1497 labeled news articles from Reuters.com and is used for the machine learning models.

Note: Please enter a valid webhose personal key before you call 'Requester.py'. Also, please change the path to your JAVAHOME environment variable in 'NER.find_companies' method.

example:

set paths

java_path = "C:\Program Files (x86)\Java\jre1.8.0_181" os.environ['JAVAHOME'] = java_path

Requirements

pandas==0.20.1 nltk==3.2.5 webhoseio==0.5 numpy==1.14.0 graphviz==0.9 scikit_learn==0.19.2

Installation under Windows

pip install XY

Installation under UBUNTU

apt-get install XX