VADER, or Valence-Aware Dictionary for sEntiment Reasoning, is a sentiment analysis tool first described in a paper by Hutto & Gilbert in 2014. Unlike its predecessors VADER is tuned for social media text, including emoji and common short-hands used in microblogs. I wanted to experiment with sentiment-based signals for Bitcoin, and it seemed worth trying to throw VADER at a particularly noisy body of text: Reddit, in particular the BitcoinMarkets subreddit daily discussion thread.
To do this I needed a mechanism for parsing Reddit comments and an implementation of VADER. JRAW, the Java Reddit API Wrapper, provided the former and Animesh Pandey's VaderSentimentJava package offered the latter; I also needed an API key for Reddit for use by the code to wire it all together. I also used Bladerunner to create a simple, configurable application runner.
Creating a Reddit app
First sign up for a Reddit account, and then go to Preferences > Apps and create a "script" type application. Note for "Redirect URI" you will need to put in a value even if not used: http://localhost:8080 will suffice. Make note of the client ID and secret values -- you'll need them in the next step.
Programmatically reading a subreddit
The starting point with JRAW is to instantiate a RedditClient. We will create a "userless" set of credentials for this application rather than logging in as ourselves -- and leverage the client ID and secret created above:
String clientId = "..."
String clientSecret = "..."
String userId = "..."
UUID appId = UUID.randomUUID();
Credentials oauthCreds = Credentials.userless(clientId, clientSecret, appId);
UserAgent userAgent = new UserAgent("Serenity", "reddit-signals", "0.1.0", userId);
// Authenticate our client
OkHttpNetworkAdapter adapter = new OkHttpNetworkAdapter(userAgent);
RedditClient reddit = OAuthHelper.automatic(adapter, oauthCreds);
The next step is to get every instance of the daily discussion thread. With JRAW's fluent query API this is very easy: we just specify the subreddit we want and search all time for the subject header:
SearchPaginator search = reddit.subreddit("BitcoinMarkets")
.search()
.timePeriod(TimePeriod.ALL)
.query("[Daily Discussion]")
.build();
The result is a streamable set of Submission objects, which we can use to extract the comment tree. Another nice feature of JRAW is the tree-walking logic is already built for you, and even better it all works with parallel streams so it's all evaluated in the fork-join common thread pool. Note that to get at the comments we need to convert our lightweight Submission object to a SubmissionReference and re-query the Reddit API mid-stream:
search.forEach(listing -> listing.getChildren().parallelStream()
.forEach(submission -> submission.toReference(reddit).comments()
.walkTree().iterator().forEachRemaining(comment -> {
String txt = comment.getSubject().getBody();
VADER analysis
Although under the hood it's the most complex piece, the sentiment analysis code is the smallest part of the program thanks to the VaderSentimentJava library. Note that getPolarity()
returns a Map with four values: positive, negative, neutral and compound. The latter is what we want: the net "direction" of the sentiment, either positive or negative. The resulting value is in the range (-1, 1).
Given a ThreadLocal to hold the analyzer (to allow reuse):
private ThreadLocal<SentimentAnalyzer> analyzer = ThreadLocal.withInitial(SentimentAnalyzer::new);
then you can analyze a Reddit comment with:
String txt = comment.getSubject().getBody();
SentimentAnalyzer sentimentAnalyzer = analyzer.get(); sentimentAnalyzer.setInputString(txt);
try {
sentimentAnalyzer.setInputStringProperties();
} catch (IOException e) {
throw new RuntimeException(e);
}
sentimentAnalyzer.analyze();
Map<String,Float> polarityMap = sentimentAnalyzer.getPolarity();
Float polarity = polarityMap.get("compound");
Taken together with the comment's timestamp (accessed via comment.getSubject().getCreated()
) we now have a timeseries of sentiment for comments made on the BitcoinMarkets daily trader discussion, across several years of discussions -- over 96K comments.
Data collection
Finally putting this all together we need to capture the timeseries data. Following on the approach taken with the Lightweight Data Layers post, we can create a very simple SQLite database:
CREATE TABLE sentiment_scores (
id VARCHAR(32) PRIMARY KEY NOT NULL,
polarity REAL NOT NULL,
weight INTEGER NOT NULL,
created_ts DATETIME NOT NULL
);
and use JDBI to insert sentiment score objects -- with a simple SELECT all query at start-up used to avoid inserting duplicate comment identifiers:
public class JdbiSentimentScoreProcessor implements SentimentScoreProcessor {
private static final String ST_GROUP = "serenity/signal/sentiment/JdbiSentimentScoreProcessor.sql.stg";
private final Jdbi dbi;
private final ST queryTemplate;
private final ST insertTemplate;
private final Set<String> knownIds = new HashSet<>();
public JdbiSentimentScoreProcessor(Jdbi dbi) {
this.dbi = dbi;
this.queryTemplate = StringTemplateSqlLocator.findStringTemplate(ST_GROUP, "selectAll");
this.insertTemplate = StringTemplateSqlLocator.findStringTemplate(ST_GROUP, "insert");
dbi.useHandle(handle -> {
String sql = queryTemplate.render();
handle.createQuery(sql).map((rs, ctx) -> rs.getString("id")).forEach(knownIds::add);
});
}
@Override
public void accept(SentimentScore sentimentScore) {
if (knownIds.contains(sentimentScore.getId())) {
return;
}
dbi.useHandle(handle -> {
String sql = insertTemplate.render();
handle.createUpdate(sql)
.bind("id", sentimentScore.getId())
.bind("polarity", sentimentScore.getPolarity())
.bind("weight", sentimentScore.getWeight())
.bind("createdTs", sentimentScore.getTimestamp())
.execute();
});
}
}
This is useful because it gives us a simple, file-based table which can be read into Jupyter, R or other Python-based data analysis tools for further analysis -- to see if our basic sentiment analyzer reveals anything about price trends in Bitcoin, e.g. for a simple example:
import pandas as pd
import sqlite3
conn = sqlite3.connect("/Users/kdowney/.serenity/sentiment.db")
df = pd.read_sql_query("select * from sentiment_scores;", conn)
df['created_ts'] = pd.to_datetime(df['created_ts'], unit='ms')
df['polarity'].mean()
shows that the overall sentiment for comments is positive -- not surprising -- about 0.1. What is a bit more interesting is the weighted average, weighted by up/down votes:
(df['polarity']*df['weight']).sum() / df['weight'].sum()
This is 0.08, slightly closer to neutral sentiment: the wisdom of the crowd seems to be that the average comments are too bullish.
The code
See https://github.com/cloudwall/reddit-sentiment for the complete code.