Author: Imad Hamoumi
1- Put your data into the directory /data.
2- Start the script with python run.py
3- follow the instructions
Note:
CSV:
+ Only two extensions are allowed currently. the first is csv and will be read using pandas.
+ You have to provide the name of the column where the scripte can read the text data.
PDF
+ In some cases, reading a pdf file is not allowed
+ Some PDF files are not well encoded
You can add your own training model in the pipline or change the cleaning parameters such as ngram size etc.