Improve tag categorization

Develop tag categorization by correcting tags, and maintain the current performance by understanding its limits.

Accuracy of the model

Similar to how human performance is measured through correct answers on a test, the model is measured with metrics such as accuracy. Once the model is trained, it is provided with sentences to tag. Tags from the model are evaluated against the correct tags pre-determined by human experts and the percentage of time the model answers correctly determines the accuracy.

Note:

When given many sentences to tag, humans do not agree with each other 100% of the time. This was measured by asking multiple human experts to tag the same document and only 85% of the time human experts agreed with each other. This 85% accuracy is the limit to how well the machine learning model will perform.
Currently, the model is at 74% accuracy compared to human experts.

How to improve tag categorization

Tag categorization can be improved by collecting more data on human experts correcting tags to improve the model. This can be improved in one of two ways:

When tag categorization suggests a word as a tag and the human expert decides it's incorrect.
When tag categorization misses a word as a tag and the human expert decides it needs to be tagged.

Once there is a large amount corrections collected across various industries, the data can be fed into the model for retraining and diagnosing why the model did not agree with human judgement.

Limits to tag categorization

Machine learning models are highly dependent on the data they are trained on. The models are trained to perform particular tasks by providing numerous examples to them and they can only perform well if the incoming new data is similar to the training data set. If the data inputted changes, performance can be affected.

The following differences from incoming data can affect the performance of model:

Providing the model with documents significantly different in length. For example, the majority of documents are 4-5 words instead of 11 words.
Using uncommon words instead of common words.
Providing the model with documents with a different median word count.

If the incoming data is different in the future, the model would need to be retained or restructured to learn from a different or new data set with similar characteristics to the incoming data.