Monday, August 1, 2016

Talking about text classifiers

In the last year following search, I searched something about machine learning, like trying to detect SPAMs at my private projects. I saw something about KNN, random decision forests and naive Bayes.


Consequently, I wrote a C++ library to classify texts and some slides for a presentation, which you can view at the end of this blog post.

So I chose Naive Bayes because Naive Bayes is one of the simplest classifiers, based on Bayes theorem with naïve and complete independence assumptions. It is one of the most basic text classification techniques with various email spam detection, document categorization, sexually explicit content detection, personal email sorting, language detection and sentiment detection(i think something like NLP). Despite the naïve design and oversimplified assumptions that this technique uses, Naive Bayes performs well in many complex real-world problems. Another good thing, Naive Bayes is suitable for limited CPU and memory resources.
To optimize detection accuracy, I use DFA(deterministic finite automaton) to match patterns and put each mark in ranking. That ranking has one classification. You can view the following code here. To make your automaton, you can use Flex, bison in another way.


If you view a presentation on slide number 12, you can see my point of view about ranking to optimize the accuracy of the classifier at results.

 


SO, This is a very cool trick to gain accuracy. No more words, friends. Thank you for reading this! 

Cheers!

References:


No comments:

Post a Comment

The magic of bits

 Before the long tale to the course of magic bits, let's gonna for a little walk in the world of C language. Variable of type int has 4 ...