Improve tag categorization
Develop tag categorization by correcting tags, and maintain the current performance by understanding its limits.
Accuracy of the model
- When given many sentences to tag, humans do not agree with each other 100% of the time. This was measured by asking multiple human experts to tag the same document and only 85% of the time human experts agreed with each other. This 85% accuracy is the limit to how well the machine learning model will perform.
- Currently, the model is at 74% accuracy compared to human experts.
How to improve tag categorization
- When tag categorization suggests a word as a tag and the human expert decides it's incorrect.
- When tag categorization misses a word as a tag and the human expert decides it needs to be tagged.
Once there is a large amount corrections collected across various industries, the data can be fed into the model for retraining and diagnosing why the model did not agree with human judgement.
Limits to tag categorization
Machine learning models are highly dependent on the data they are trained on. The models are trained to perform particular tasks by providing numerous examples to them and they can only perform well if the incoming new data is similar to the training data set. If the data inputted changes, performance can be affected.
- Providing the model with documents significantly different in length. For example, the majority of documents are 4-5 words instead of 11 words.
- Using uncommon words instead of common words.
- Providing the model with documents with a different median word count.
If the incoming data is different in the future, the model would need to be retained or restructured to learn from a different or new data set with similar characteristics to the incoming data.