A Shiny Word Predictor!

2014 Data Science Capstone - Tweet me, honey, honey, Tweet me!

Sepanda Pouryahya.

Free to use and runs right in the browser, using Html5, Shiny and R.

Word Predictions	User Options!	Twitter mode/tweet vocab

The app uses two different data sets. One for general purpose predictions and another for Twitter predictions.
General predictions are made by extracting info using a fancy bit of Google inspired regex!

theGpattern <- "[A-GJXa-gjx]#|([£\\$])?[0-9.]+|[a-zA-Z&_0-9]+('[stdm]|('ve)|('ll)|('re))?\\+*|[-\\.@!%\\*\\(\\)=\\{\\}\\|:;,><\\?/~`'#+£\\$\"]"

The user input is then processed, using a "Stupid backoff" or "Knesser-Ney" algorithm, picking out the highest and most relevant predictions based on preprocessed data from HC Corpora data sets.
Tweets get their own special word processing. Why? Punctuation in Tweets can often mean more than the words around them.

TweetText <- "NLP makes me :) I <3 it!"

Solution? Treat groups of punctuation, letters and numbers as words in their own right!

The training scheme involved 70% of the entire data set, the following is based on random sampling of the remaining 30% of the data.

The fitness test against random samples of the held out test data yielded the following (averaged) values.

Testing	Alogrithm	Number of predicted words	success rate
Twitter	Backoff	1	14.7 %
	Backoff	2	19.2 %
	Backoff	3	23.2 %
Blog/News	Backoff	1	18.8 %
	Backoff	2	28.3 %
	Backoff	3	34.1 %
Blog/News	Kneser-Ney	1	15 %
	Kneser-Ney	2	21.8 %
	Kneser-Ney	3	27.5 %

Instructions for using the app can be found within the app itself!

It's fun, it's free!

More about me: