Fungus Codes: Talking about text classifiers

In the last year following search, I searched something about machine learning, like trying to detect SPAMs at my private projects. I saw something about KNN, random decision forests and naive Bayes.

Consequently, I wrote a C++ library to classify texts and some slides for a presentation, which you can view at the end of this blog post.

So I chose Naive Bayes because Naive Bayes is one of the simplest classifiers, based on Bayes theorem with naïve and complete independence assumptions. It is one of the most basic text classification techniques with various email spam detection, document categorization, sexually explicit content detection, personal email sorting, language detection and sentiment detection(i think something like NLP). Despite the naïve design and oversimplified assumptions that this technique uses, Naive Bayes performs well in many complex real-world problems. Another good thing, Naive Bayes is suitable for limited CPU and memory resources.

To optimize detection accuracy, I use DFA(deterministic finite automaton) to match patterns and put each mark in ranking. That ranking has one classification. You can view the following code here. To make your automaton, you can use Flex, bison in another way.

If you view a presentation on slide number 12, you can see my point of view about ranking to optimize the accuracy of the classifier at results.

Improving spam detection with automaton from Antonio Costa aka Cooler_

SO, This is a very cool trick to gain accuracy. No more words, friends. Thank you for reading this!

Cheers!

References:

Natural Language Processing by Dan Jurafsky, Christopher Manning
John, G. H. e Langley, P. (1995). Estimating continuous distributions in bayesian classifiers. Montreal, Quebec; Canada.
Svore, K. M., Wu, Q., e Burges, C. J. (2007). Improving web spam classification using rank-time features. Banff, Alberta, Canada.

Fungus Codes

Monday, August 1, 2016

Talking about text classifiers

No comments:

Post a Comment

The magic of bits

Friends

Report Abuse