A Shiny Word Predictor!

2014 Data Science Capstone - Tweet me, honey, honey, Tweet me!

Sepanda Pouryahya.

The App

Free to use and runs right in the browser, using Html5, Shiny and R.

  • Just type
  • Point
  • Click!
Word Predictions User Options! Twitter mode/tweet vocab

Grab it here: https://sepanda.shinyapps.io/ShinyWordPredictor/

How it works

  • The app uses two different data sets. One for general purpose predictions and another for Twitter predictions.
  • General predictions are made by extracting info using a fancy bit of Google inspired regex!
theGpattern <- "[A-GJXa-gjx]#|([£\\$])?[0-9.]+|[a-zA-Z&_0-9]+('[stdm]|('ve)|('ll)|('re))?\\+*|[-\\.@!%\\*\\(\\)=\\{\\}\\|:;,><\\?/~`'#+£\\$\"]"
  • The user input is then processed, using a "Stupid backoff" or "Knesser-Ney" algorithm, picking out the highest and most relevant predictions based on preprocessed data from HC Corpora data sets.

  • Tweets get their own special word processing. Why? Punctuation in Tweets can often mean more than the words around them.

TweetText <- "NLP makes me :) I <3 it!" 

Solution? Treat groups of punctuation, letters and numbers as words in their own right!

How well does it do?

The training scheme involved 70% of the entire data set, the following is based on random sampling of the remaining 30% of the data.

The fitness test against random samples of the held out test data yielded the following (averaged) values.

Testing Alogrithm Number of predicted words success rate
Twitter Backoff 1 14.7 %
Backoff 2 19.2 %
Backoff 3 23.2 %
Blog/News Backoff 1 18.8 %
Backoff 2 28.3 %
Backoff 3 34.1 %
Blog/News Kneser-Ney 1 15 %
Kneser-Ney 2 21.8 %
Kneser-Ney 3 27.5 %

Thanks for your time.!