Christopher J. MacLellan

Machine Learning with Mixed Numeric and Nominal Data using COBWEB/3

2016-05-13T00:00:00+00:00

Demonstration of two extensions to the cobweb/3 algorithm that enables it to better handle features with continuous values.

Problem Overview

Many machine learning applications require the use of both nominal (e.g., color=”green”) and numeric (e.g., red=0, green=255, and blue=0) feature data, but most algorithms only support one or the other. This requires researchers and data scientists to use a variety of ad hoc approaches to convert troublesome features into the format supported by their algorithm of choice. For example, when using regression based approaches, such as linear or logistic regression, all features need to be converted into continuous features. In these cases, it is typical to encode each pair of nominal attribute values as its own 0/1 feature. It is also typical to use one-hot or one-of-K encodings for nominal features. When using algorithms that support nominal features, such as Naive Bayes or Decision Trees, numerical features often need to be converted. In these cases, the numerical features might be converted into discrete classes (e.g., temperature = “high”, “medium”, or “low”). When converting features from one type to another their meaning is implicitly changed. Nominal features converted into numeric are treated by the algorithm as if values between 0 and 1 were possible. Also, some algorithms assume that numeric features are normally distributed and nominal features converted into numeric will follow a binomial, rather than normal, distribution. Conversely, numeric features converted into nominal features lose some of their information.

Luckily, there exist a number of algorithms that support mixed feature types and attempt to balance the different types of features appropriately (usually these are extensions of algorithms that support nominal features). For example, Decision Tree algorithms are often modified to automatically support median split features as a way of handling numeric attributes. Also, Naive Bayes has been modified to use a Gaussian distribution for modeling numeric features. Today I wanted to talk about another algorithm that supports both numeric and nominal features, COBWEB/3.

COBWEB Family of Algorithms

Over the past year Erik Harpstead and I have been developing python implementations of some of the machine learning algorithms in the COBWEB family (our code is freely available on GitHub). At their core, these algorithms are incremental divisive hierarchical clustering algorithm that can be used for supervised, semi-supervised, and unsupervised learning tasks. Given a sequence of training examples, COBWEB constructs a hierarchical organization of the examples. Here is a simple example of a COBWEB tree (image from wikipedia article on conceptual clustering):

Each node in the hierarchy maintains a probability table of how likely each attribute value is given the concept/node. To construct this tree hierarchy, COBWEB sorts each example into its tree and at each node it considers four operations to incorporate the new example into its tree:

To determine which operation to perform, COBWEB simulates each operation, evaluates the result, and then executes the best operation. For evaluation, COBWEB uses Category Utility:

\[CU({C_1, C_2, ..., C_n}) = \frac{1}{n} \sum_{k=1}^n P(C_k) \left[ \sum_i \sum_j P(A_i = V_{ij} | C_k)^2 - \sum_i \sum_j P(A_i = V_{ij})^2 \right]\]

This measure corresponds to increase in the number of attribute values that can be correctly guessed in the children nodes over the parent. It is similar to the information gain metric used by decision trees, but optimizes for the prediction of all attributes rather than a single attribute.

Once COBWEB has constructed a tree, it can be used to return clusterings of the examples at varying levels of aggregation. Further, COBWEB can perform prediction by categorizing partially specified training examples (e.g., missing the attributes to be predicted) into its tree and using its tree nodes to make predictions about the missing attributes. Thus, COBWEB’s approach to prediction shares many similarities with Decision Tree learning, which also categorizes new data into its tree and uses its nodes to make predictions. However, COBWEB can be used to predict any missing attribute (or even multiple missing attributes), whereas each Decision Tree can be used to predict a single attribute.

COBWEB’s approach is also similar to the k-nearest neighbors algorithm; e.g., it finds the most similar previous training data and uses this to make predictions about new instances. However, COBWEB uses its tree structure to make the prediction process more efficient; i.e., it uses its tree structure to guide the search for nearest neighbors and compares each new instance with $O(log(n))$ neighbors rather than $O(n)$ neighbors. Earlier versions of COBWEB were most similar to nearest neighbor $k=1$ because they always categorized to a leaf and used the leaf to make predictions, but later variants (which we implement in our code) use past performance to decide when it is better to use intermediate tree nodes to make predictions (higher nodes in the tree correspond to a larger number of neighbors being used). Thus, with less data, COBWEB might function similarly to nearest neighbors, but as it accumulates more data it dynamically adapts the number of neighbors it uses to make predictions based on past performance.

COBWEB/3

Now returning to our original problem, the COBWEB algorithm only operates with nominal attributes; e.g., a color attribute might be either “green” or “blue”, rather than a continuous value. The original algorithm was extended by COBWEB/3, which modeled the probability of each numeric attribute value using a Gaussian distribution (similar to how Naive Bayes handles numeric data). With this change, i.e., $P(A_i = V_{ij} | C_k)^2$ is replaced with $\frac{1}{2 * \sqrt{pi} * \sigma_{ik}}$ and $P(A_i = V_{ij})^2$ is replaced with $\frac{1}{2 * \sqrt{pi} * \sigma_{i}}$ for continuous attributes. I find COBWEB’s approach to handling mixed numeric and nominal attributes interesting because it treats each as what they are (numeric or nominal), but combines them in a principled way using category utility. This approach is not without problems though. I’m going to talk about two problems with the COBWEB/3 approach and how I have overcome them in my COBWEB/3 implementation.

Problem 1: Small values of (sigma)

This approach runs into problems when the $\sigma$ values get close to 0. In these situations $\frac{1}{\sigma} \rightarrow \infty$. To handle these situations, COBWEB/3 bound $\sigma$ by $\sigma_{acuity}$, a user defined value that specifies the smallest allowed value for (sigma). This ensures that $\frac{1}{\sigma}$ never becomes undefined. However, this does not take into account situations when $P(A_i) \lt 1.0$. Additionally, for small values of $\sigma_{acuity}$, this lets COBWEB achieve more than 1 expected correct guess per attribute, which is impossible for nominal attributes (and does not really make sense for continuous either). This causes problems when both nominal and continuous values are being used together; i.e., continuous attributes will get higher preference because COBWEB tries to maximize the number of expected correct guesses.

To account for this my implementation of COBWEB/3 uses the modified equation: $P(A_i = V_{ij})^2 = P(A_i)^2 * \frac{1}{2 * \sqrt{pi} * \sigma}$. The key change here is that we multiply by $P(A_i)^2$ for situations when attributes might be missing. Further, instead of bounding $\sigma$ by acuity, we add some independent, normally distributed noise to $\sigma$: i.e., $\sigma = \sqrt{\sigma^2 + \sigma_{acuity}^2}$, where $\sigma_{acuity} = \frac{1}{2 * \sqrt{pi}}$. This ensures the expected correct guesses never exceeds 1. From a theoretical point of view, it basically is a assumption that there is some independent, normally distributed measurement error that is added to the estimated error of the attribute. It is possible that there is additional measurement error, but the value is chosen so as to yield a sensical upper bound on the expected correct guesses.

To get a sense for how adding noise to $\sigma$ impacts its value I plotted the original and noisy values of $\sigma$ given different values of $\sigma$, so we can see how they differ:

The plot basically shows that for $\sigma \lt 1$ there is less than 1% difference between the original and noisy $\sigma$, but for small values the difference increases as the original value approaches 0.

To get a sense of how this impacts the expected correct guesses, I plotted the expected number of correct guesses for the numeric attribute $(P(A_i = V_{ij})^2)$ for the three possible approaches: unbounded $\sigma$, acuity bounded $\sigma$, and noisy $\sigma$:

This graph shows that the unbounded version exceeds 1 correct guess as we get close to 0. This is bad when we have mixed numeric and nominal features because numeric features will worth more than the nominal features. Next, we see that the acuity bounded version levels off at 1 correct guess. This is also a problem because it makes it impossible for COBWEB to distinguish which values produce the best category utility for small values of $\sigma$. The noisy version produces the most reasonable results: it provides the ability to discriminate between different values of $\sigma$ for the entire range (even small values), it never exceeds 1 correct guess, and for medium and larger values of $\sigma$ it produces the same behavior as the other approaches.

Problem 2: Sensitive to Scale of Variables

Additionally, COBWEB/3 is sensitive to the scale of numeric feature values. If feature values are large (e.g., 0-1000 vs 0-1), then the standard deviation of the values will be larger and it will have a lower number of expected correct guesses. To overcome this limitation, my implementation of COBWEB/3 performs online normalization of the numeric features using the estimated standard deviation of all values, which is maintained at the root of the categorization tree. My implementation normalizes all numeric values to have standard deviation equal to one half, which is maximum standard deviation that can be achieved in a nominal value. This ensures that numeric and nominal values are treated equally.

Evaluation of COBWEB/3 Implementation

In order to test whether my implementation of COBWEB/3 is functioning correctly and overcoming the stated limitations, I replicated the COBWEB/3 experiments conducted by Li and Biswas (2002). In their study, they created an artificial dataset with two numeric and two nominal features. Here is the approach Li and Biswas (2002) used to generate their 180 artificial datapoints:

The nominal feature values were predefined and assigned to each class in equal proportion. Nominal feature 1 has a unique symbolic value for each class and nominal feature 2 had two distinct symbolic values assigned to each class. The numeric feature values were generated by sampling normal distributions with different means and standard deviations for each class, where the means for the three classes are three standard-deviations apart from each other. For the first class, the two numeric features are distributed as (N(mu=4, sigma=1)) and (N(mu=20, sigma=2)); for the second class, the distributions were (N(mu=10, sigma=1)) and (N(mu=32, sigma=2)); and for the third class, (N(mu=16, sigma=1))and (N(mu=44, sigma=2)).

Given these three clusters Li and Biswas added non-Gaussian noise to either the numeric or nominal features. To add noise, some percentage of either the numeric or nominal features were randomly selected and given the values specified by the other clusters. To compute a clustering Li and Biswas trained COBWEB/3 using all of the examples then assigned cluster labels based on which child of the root the examples was assigned to. Next, Li and Biswas computed a misclassification score for each assignment. The misclassification count was computed using the following three rules:

If an object falls into a fragmented group, where its type is a majority, it is assigned a value of 1,
If the object is a minority in any group, it is assigned a misclassification value of 2, and
Otherwise the misclassification value for an object is 0.

For this calculation if more than three groups are formed by COBWEB/3, the additional smaller groups are labeled as fragmented groups.

Here is the graph of COBWEB/3’s misclassification count (taken Li and Biswas’s paper) for increasing amounts of noise to either the numeric or nominal features:

We can see from this graph that COBWEB/3 does not treat noise in the numeric and nominal features equally. Noise in the nominal values seems to have more of an impact on misclassification than noise in the numeric values. To ensure my implementation of COBWEB/3 is functioning normally after adding the new new approach for handling small values of $\sigma$, I shut off normalization at the root and attempted to replicate Li and Biswas’s results. Here is a graph showing the performance of my implementation:

At a rough glance, it seems like both implementations are performing the same. Looking a little closer, my implementation has less misclassifications overall (max of ~140 vs. ~200). So it looks like the new treatment of $\sigma$ is working and maybe is even improving performance. However, it seems like my implementation is still giving preference to nominal features; i.e., noise in nominal features has a bigger impact on misclassification count. When reading the Li and Biswas paper, my initial thoughts were that the dispersal of the numeric values (e.g., std of the clusters are either 1 or 2) might cause COBWEB/3 to give less preference to numeric features because larger standard deviations of values correspond to a lower number of expected correct guesses (see the figure titled Comparison of COBWEB/3 Numeric Correct Guess Formulations). To test if this was the case, I activated online normalization of numeric attributes in my COBWEB/3 implementation and replicated the experiment. Here is the results of this experiment:

This result shows that COBWEB/3 with online normalization treats numeric and nominal attributes more equally. The original Li and Biswas paper compared COBWEB/3 to two other systems (AUTOCLASS and ECOBWEB) and to a new algorithm that they propose (SBAC). Their proposed algorithm was the only one that had better performance than COBWEB/3. When I looked at the implementation details, a key feature of their system is that is is less sensitive to the scale of attributes because it uses median splits. This last graph shows that my COBWEB/3 implementation performs as well (maybe better) than their proposed SBAC algorithm.

Conclusions

The COBWEB/3 algorithm provides a natural approach for doing machine learning with mixed numeric and nominal data. However, it has two problems: it struggles when the standard deviation of numeric attributes is close to zero and it is sensitive to the scale of the numeric attributes. Erik Harpstead and I created an implementation of COBWEB/3 (available on GitHub) that overcomes both of these issues. To handle the first issue we assume some measurement noise for numeric attributes that results in expected correct guess values that make sense (i.e., no more than one correct guess can be achieved per numeric attribute) and that provide the ability to discriminate between values of $\sigma$ across the entire range $(0,\infty)$. To handle the second issue we added a feature for performing online normalization of numeric features. Next, I tested the modified version of COBWEB/3 by replicating an artificial noise experiment by Li and Biswas (2002). I showed that showed that modified COBWEB/3 is less sensitive to noise than the the original version of COBWEB/3 and treats numeric and nominal attributes more equally. These results show that my implementation of COBWEB/3 is an effective approach for clustering with mixed numeric and nominal data.

Modeling Student Learning in Python

2016-04-22T00:00:00+00:00

An exploration of how to model human learing in a geometry tutor using the additive factors model (AFM) as well the extended model that accounts for slipping (AFM+S).

Problem Overview

Statistical models of student learning, such as the Additive Factors Model Cen, 2009), have been getting a lot of attention recently at educational technology conferences, such as Educational Data Mining. These models are used to estimate students’ knowledge of particular cognitive skills (e.g., how to compute the sum two numbers) given their problem-solving process data. The learned estimates can then be used to predict subsequent student performance and to inform adaptive problem selection. In particular, educational technologies, such as intelligent tutoring systems, can use these estimates to assign each student practice problems that target their specific weaknesses, so that they do not waste time practicing skills they already know. There is some evidence that the time savings with this approach is substantial (Yudelson and Koedinger, 2013); for example, studies have shown that students can double their learning in the same time (Pane et al., 2013) or learn more in approximately half the time (Lovett et al., 2008) when using an adaptive intelligent tutoring system.

Additionally, statistical models of learning can be used to identify the component skills that students are learning in a particular digital learning environment. A key element of these models is that researchers must label the steps of each problem with the skills that they believe are needed to correctly perform them. Researchers can then develop alternative skill models (also called knowledge component models, using the convention from Koedinger et al., (2012) and see which models result in an increased ability to predict the student behavior. The division between skills might not seem important, but is in fact crucial for adaptive problem selection in that it makes the adaptive selection more precise. When skills are too coarse, students spend time practicing sub-skills they don’t need to. Conversely, when they are too fine, students have to do additional work to prove that they know each skill when, in fact, multiple skills are actually just the same skill.

Given the usefulness of these models and their practical applications, a lot of work has been done to make it easier for researchers to develop new models (both statistical and skill). One resource that I constantly rely on for my research is DataShop, an online public repository for student learning data that (at the time of this writing) contains more than 193 million student transactions across 821 datasets. Further, the DataShop platform also implements the Additive Factors Model, a popular statistical model of student learning, so it is easy to start investigating learning in the available datasets.

While DataShop is capable of running the Additive Factors Model server side, there is no easy way to run the model on your local machine. This is an issue when I want to run different kinds of evaluation on the models that are not available directly in DataShop (e.g., different kinds of cross validation). It is also a problem when you have large datasets because DataShop will not run the Additive Factors Model if the dataset is too big. To overcome this issue some researchers have used R formulas that approximate the model (e.g., see DataShop’s documentation). However, these approximations don’t don’t take into account some of the key features of the Additive Factors Model, such as strictly positive learning rates. Additionally, it isn’t possible to use other variants of of the Additive Factors Model, such as my variant that adds slipping parameters (MacLellan et al., 2015); i.e., it models situations where students get steps wrong even when they correctly know the skills, which is a key feature in other statistical learning models such as Bayesian Knowledge Tracing.

To address this issues I implemented both the standard Additive Factors Model (AFM) and my Additive Factors Model + Slip (AFM+Slip) in Python. Further, I wrote the code so that it accepts data files that are in DataShop format (thanks to Erik Harpstead, we should be able to support both transaction-level and student-step-level exports). The code, which I am tentatively calling pyAFM, is available on my GitHub repository. In this blog post, I briefly review these two models that I implement (AFM and AFM+Slip) and provide an example of how they can be applied to one of the public datasets on DataShop.

Background

The Additive Factors Model, and other students models such as Bayesian Knowledge Tracing (Corbett & Anderson, 1994), extend traditional Item-Response Theory models to model student learning (IRT only models item difficulty and student skill). A key component of these modeling approaches is something called a Q-Matrix (Barnes, 2005), which is mapping of student steps to the skills, or knowledge components, needed to solve them . An initial mapping is typically based on problem types. For example, all problems in the multiplication unit might be labeled with the multiplication skill. Another common initial mapping is to assign each unique step to its own skill (this is basically what most IRT models do). However, good mappings can be difficult to find and often require researchers to iteratively test mappings to see which better fits the student data. Approaches that utilize Q matrices and that model learning typically fit the data substantially better than simple regression models based on problem type alone, such as the technique discussed in this Khan Academy blog post, because they take the effects of different component skills and learning into account.

Additive Factors Model

Many student learning models have been proposed that utilize Q matrices and that model learning, but I’ve chosen to focus on the Additive Factors Model, which is one of the more popular models. The Additive Factors Model is a type of logistic regression model. As such, it assumes that the probability that a student will get step i correct ($p_i$) follows a logistic function:

\[p_i = \frac{1}{1 + e^{-z_i}}\]

In the case of Additive Factors Model,

\[z_i = \alpha_{student(i)} + \sum_{k \in KCs(i)} (\beta_k + \gamma_k \times opp(k, i))\]

where $\alpha_{student(i)}$ corresponds to the prior knowledge of the student who performed step i, $KCs(i)$ specifies the set of knowledge components used on step i (from the Q-matrix), $\beta_k$ specifies the difficulty of the knowledge component k, $\gamma_k$ specifies the rate at which the knowledge component k is learned, and $opp(k, i)$ is the number of practice opportunities the student has had on knowledge component k before step i. Here is an annotated visual representation of the learning curve predicted by the Additive Factors Model:

The Additive Factors Model, as specified by Cen and DataShop, also has two additional features. First, the learning rates are restricted to be positive, under the assumption that practice can only improve the likelihood that a student will get a step correct. Second, to prevent overfitting an L2 regularization is applied to student intercepts. These two features are left out in most implementations of the Additive Factors Model because they cannot be easily be implemented in most logistic regression packages.

To implement the Additive Factors Model, I implemented my own Logistic Regression Classifier (on my GitHub here). For convenience, I implemented my classifier as a Scikit-learn classifier (so I can more easily use their cross validation functions). I couldn’t just use Scikit-learn’s logistic regression class because it didn’t provide me with the ability to use box constraints (i.e., to specify that learning rates most always be greater than or equal to 0). Their implementation also does not allow me to specify different regularization settings for individual parameters (i.e., to only regularize the student intercepts). My custom logistic regression classifier implements this functionality.

Additive Factors Model + Slip

This model extends the previous model to include slipping parameters for each knowledge component. These parameters are used to account for situations where a students incorrectly apply a skill even though they know it. To add these parameters to the model I had to extend logistic regression to include bounds on one side of the logistic function (an approach I call Bounded Logistic Regression in my paper). Now, the probability that a student will get a step i correct ($p_i$) is modeled as two logistic functions multiplied together:

\[p_i = \frac{1}{1 + e^{-s_i}} \times \frac{1}{1 + e^{-z_i}}\]

Identical to the previous model,

\[z_i = \alpha_{student(i)} + \sum_{k \in KCs(i)} (\beta_k + \gamma_k \times opp(k, i))\]

Additionally, the slipping behavior is modeled as, \[s_i = \tau + \sum_{k \in KCs(i)} \delta_k \] where $\tau$ corresponds to the average slip rate of all knowledge components and $\delta_k$ corresponds to the difference in slipping rate for the knowledge component k from the average. The logistic function is used to model the slipping behavior because in some situations steps are labeled with multiple knowledge components and the logistic function has been shown to approximate both conjunctive and disjunctive behavior when combining parameters (Cen, 2009). Here is what the predicted learning curve looks like after incorporating the new skipping parameters:

Similar to the previous model, this model also constrains the learning rates to be positive and regularizes the student intercepts. Additionally, it regularizes the individual student slip parameters (the (\delta)’s) to prevent overfitting.

In order to construct this model I implemented a Bounded Logistic Regression classifier (code is on my GitHub). This classifier is different from the traditional Logistic Regression in that it accepts two separate sets of features (one set for each logistic function).

Comparison on the Geometry Dataset

In my paper I tested this technique on five different datasets across four domains (Geometry, Equation Solving, Writing, and Number Line Estimation). For this blog post, I wanted to step through the process of running both models on the Geometry dataset and to highlight one situation where the two models differ in their behavior.

First, I went to DataShop and exported the Student Step file for the Geometry 96-97 dataset. Next, I ran my process_datashop.py script twice on the exported student step file (once for AFM and once for AFM+S). I selected the knowledge component model that I wanted to analyze (I chose the model LFASearchAICWholeModel3, which is one of the best fitting) and my code returned the following cross validation results (I only included the first three KC and Student parameter estimates for brevity):

$ python3 process_datashop.py -m AFM ds76_student_step_All_Data_74_2014_0615_045213.txt
 Unstratified CV    Stratified CV    Student CV    Item CV
-----------------  ---------------  ------------  ---------
            0.397            0.400         0.410      0.402
KC Name                                                             Intercept (logit)    Intercept (prob)    Slope
----------------------------------------------------------------  -------------------  ------------------  -------
Geometry*Subtract*Textbook_New_Decompose-compose-by-addition                    2.563               0.928    0.000
Geometry*circle-area                                                           -0.236               0.441    0.171
Geometry*decomp-trap*trapezoid-area                                            -0.536               0.369    0.091
...
Anon Student Id                         Intercept (logit)    Intercept (prob)
------------------------------------  -------------------  ------------------
Stu_bc7afcb7eef3ccfc1fc6547ed5fcde34               -0.316               0.422
Stu_c43d4a17398b2667daacdc70c76cf8ef               -0.055               0.486
Stu_4902934aaa88223a58cd80f44d0011e1               -0.017               0.496
...

$ python3 process_datashop.py -m AFM+S ds76_student_step_All_Data_74_2014_0615_045213.txt
Unstratified CV    Stratified CV    Student CV    Item CV
-----------------  ---------------  ------------  ---------
            0.396            0.397         0.409      0.399
KC Name                                                             Intercept (logit)    Intercept (prob)    Slope    Slip
----------------------------------------------------------------  -------------------  ------------------  -------  ------
Geometry*Subtract*Textbook_New_Decompose-compose-by-addition                   17.573               1.000    0.745  -0.592
Geometry*circle-area                                                           -1.098               0.250    0.194   0.172
Geometry*decomp-trap*trapezoid-area                                            -1.386               0.200    0.106  -0.288
...
Anon Student Id                         Intercept (logit)    Intercept (prob)
------------------------------------  -------------------  ------------------
Stu_bc7afcb7eef3ccfc1fc6547ed5fcde34               -0.307               0.424
Stu_c43d4a17398b2667daacdc70c76cf8ef               -0.007               0.498
Stu_4902934aaa88223a58cd80f44d0011e1                0.174               0.543
...

These results show that for the this dataset the AFM+S model performs better on cross validation than the AFM model (for all types of cross validation). It is also interesting to note that the difficulties of the skills are different after taking the slipping into account. The student prior knowledge estimates are also different. This makes sense because the initial difficulty and prior knowledge estimates are being adjusted to account for the slipping rates. It is also important to note that the learning rate estimates are much higher in the AFM+S model. To get a better sense for why, I plotted the learning curves from the student data with the associated predicted learning curves from the two models. To do this I used my plot_datashop.py script (when prompted I selected the LFASearchAICWholeModel3 KC model) :

$ python3 plot_datashop.py ds76_student_step_All_Data_74_2014_0615_045213.txt

This returns an overall learning curve plot for all of the knowledge components together and a plot of the learning curve for each individual knowledge component. Here is the overall learning curve:

It is a little hard to to see differences between the two models, but one thing to point out is that the AFM+S model better fits the steeper learning rate at the beginning. This would agree with my finding that the learning rates in the AFM+S model tend to be higher than the AFM model. This effect is much more pronounced if we choose knowledge component where the error rate does not converge to zero. For example, lets look at the Geometry*compose-by-multiplication skill:

We can see from this graph that the traditional additive factors model is trying to converge to zero error, so higher error rates in the tail cause it to have a shallower learning rate at the beginning. In contrast, the model with the slipping parameters better models the initial steepness of the learning curve, as well as the higher error rate in the tail.

Summary

I implemented the Additive Factors Model and Additive Factors Model + Slip in Python and showed how it can be used to model student learning in a publicly available dataset from DataShop. My hope is that by making AFM and AFM+S available and easy to run in Python, more people will consider using it (Hey you over at Khan Academy!). Also, as a researcher, I know how much of a pain it is to have to implement someone else’s student modeling technique, just to compare your technique to it. Now other researchers can easily compare their student learning model to the Additive Factors Model (with all of its nuances, such as positive learning rates and regularized student intercepts) as well as to my Additive Factors Model with Slipping parameters. I’d welcome people showing that their technique is better than mine, just remember to cite me :).

Lazy Shuffled List Generator

2016-04-05T00:00:00+00:00

An attempt to find a generalized approach for shuffling a very large list that does not fit in memory. I explore the creation of generators that sample items without repeats using random number generators.

Problem Overview

For one of my projects, I need to generate a random permutation (without repetition) of the integers from 0 to n. In my particular situation, I am generating permutations for large n and I am only using a small portion of the values at the beginning of sampled permutation (e.g., generating a permutation of a shorter length k, where k is small but n is large). My initial approach was to build an array of values from 0 to n and then shuffling it using the Fisher-Yates algorithm. However, for large values of n, I am unable to hold all of the elements in memory (in Java, I get a heap overflow), so I started looking for more memory efficient approaches that let me generate the elements of a random permutation as needed. In essence, I am looking for an algorithm to do a lazy list shuffle.

I did some googling and found this approach, which consists of constructing a finite field given a prime modulus m; e.g., $aX$ mod $m$ for $a = 1,2,3, \cdots, n$. We know that the field contains the numbers $1 \cdots m-1$ when m is prime (I think it was Gauss that showed this in his book Disquisitiones Arithmeticae). This ensures that we generate all of the values in the desired range at some point. However, sometimes it generates additional values (when m is larger than the size of the range). In these cases, we can simply ignore values outside the range and keep generating values until they are in the desired range. This approach, which comes from the work on format-preserving encryption</a>, is actually pretty efficient if m is not that much larger than the size of the range because only a small portion of values are invalid. Thus, this approach works by selecting the smallest prime $m$ larger than the size of the range we want to generate, selecting a random generator $a$, and selecting an initial seed $X_0$. Then we can generate successive values using the relation $X_n = X_{n-1} + a$ mod $m$. While this approach can be used to lazily generate the elements of a random permutation with constant memory, some of my early testing showed that the permutations generated are not actually that random.

To overcome this limitation, I started looking for alternative approaches that generate more “random” looking permutations. It turns out the first approach I found is a special case of a linear congruential generator, a well-known pseudorandom number generator, that only has an additive component (no multiplicative component). Researching this type of number generator more, I found that a Lehmer, or multiplicative congruential, generator is another special case of a linear congruential generator that can be used to generate more random looking permutations. This type of generator lets me have precise control over the number of elements in the finite field, which I can use to ensure we generate full permutations. Additionally, I found another pseudorandom number generator, a linear feedback shift register, that can be used to generate more random looking sequences, while still having precise control over the number of elements in the finite field. These two approaches, which are both based on Finite Field theory, require the selection of a modulus and a generator that determines the size of the finite field and the order in which these elements are generated. To ensure that the finite fields in both approaches have maximal order (i.e., that all of the values in the range appear in the generated permutation) we must ensure that the generators in both approaches are primitive roots of the finite fields. For the multiplicative congruential generator, this consists of selecting the smallest prime $m$ larger than the size of the range and a value $a$ that is a primitive root modulo $m$. For the linear feedback shift register, this consists of selecting a modulo two polynomial that is a primitive root for the number of bits being used in the linear feedback shift register.

To test all three approaches, I implemented them in python (I refer to the functions as lazyshuffledrange1, lazyshuffledrange2, and lazyshuffledrange3) and compare them to each other and to the naive Fisher-Yates approach (which I call shuffledrange). Here are my python implementations of each function.

Approaches

1. Fisher-Yates (shuffledrange)

from random import shuffle
def shuffledrange(start, stop=None, step=1):
    """
    This generates the full range and shuffles it using random.shuffle,
    which implements the Fisher-Yates shuffle: 
        <a href="https://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle" target="_blank">https://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle</a>
    From the python docs: 
        Note that for even rather small len(x), the total number of
        permutations of x is larger than the period of most random number
        generators; this implies that most permutations of a long sequence can
        never be generated.
    This function has the same args as the builtin ``range'' function, but
    returns the values in shuffled order:
        range(stop)
        range(start, stop[, step])
    &gt;&gt;&gt; sorted([i for i in shuffledrange(10)])
    [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
    """
    if stop == None:
        stop = start
        start = 0
    p = [i for i in range(start, stop, step)]
    shuffle(p)
    for i in p:
        yield i<br>

2. Additive Congruential Generator (lazyshuffledrange1)

from random import randint
def lazyshuffledrange1(start, stop=None, step=1):
    """
    This approach can be used to lazily generate the full range and shuffles it 
    using a modular finite field. The approach was take from here:
        <a href="http://stackoverflow.com/a/16167976" target="_blank">http://stackoverflow.com/a/16167976</a>
    This function has the same args as the builtin ``range'' function, but
    returns the values in shuffled order:
        range(stop)
        range(start, stop[, step])
    >>> sorted([i for i in lazyshuffledrange1(10)])
    [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
    >>> sorted([i for i in lazyshuffledrange1(2, 20, 3)])
    [2, 5, 8, 11, 14, 17]
    """
    if stop is None:
        stop = start
        start = 0
    l = (stop - start) // step
    m = nextPrime(l)
    a = randint(2,m-1)
    seed = randint(0,l-1)
    x = seed
    while True:
        if x < l:
            yield step * x + start
        x = (x + a) % m
        if x == seed:
            break

def nextPrime(n):
    p = n+1
    if p <= 2:
        return 2
    if p%2 == 0:
        p += 1
    while not isPrime(p):
        p += 2
    return p
    
def isPrime(n):
    if n <= 1:
        return False
    elif n <= 3:
        return True
    elif n % 2 == 0 or n % 3 == 0:
        return False
    i = 5
    while i*i <= n: 
        if n % i == 0 or n % (i + 2) == 0:
            return False
        i += 6
    return True

3. Multiplicative Congruential Generator (lazyshuffledrange2)

from random import randint
import math
def lazyshuffledrange2(start, stop=None, step=1):
    """
    This generates the full range and shuffles it using a Lehmer random number
    generator, which is also called a multiplicative congruential generator and
    is a special case of a linear congruential generator:
        <a href="https://en.wikipedia.org/wiki/Lehmer_random_number_generator" target="_blank">https://en.wikipedia.org/wiki/Lehmer_random_number_generator</a>
        <a href="https://en.wikipedia.org/wiki/Linear_congruential_generator" target="_blank">https://en.wikipedia.org/wiki/Linear_congruential_generator</a>
    Basically, we are iterating through the elements in a finite field. There
    are a few complications. First, we select a prime modulus that is slightly
    larger than the size of the range. Then, if we get elements outside the
    range we ignore them and continue iterating. Finally, we need the generator
    to be a primitive root of the selected modulus, so that we generate a full
    cycle. The seed provides most of the randomness of the permutation,
    although we also randomly select a primitive root.
    This function has the same args as the builtin ``range'' function, but
    returns the values in shuffled order:
        range(stop)
        range(start, stop[, step])
    &gt;&gt;&gt; sorted([i for i in lazyshuffledrange2(10)])
    [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
    &gt;&gt;&gt; sorted([i for i in lazyshuffledrange2(2, 20, 3)])
    [2, 5, 8, 11, 14, 17]
    """
    if stop is None:
        stop = start
        start = 0
    l = (stop - start) // step
    m = nextPrime(l)
    a = randint(2,m-1)
    while not isPrimitiveRoot(a, m):
        a = randint(2,m-1)
    seed = randint(1,l-1)
    x = seed
    while True:
        if x &lt;= l:
            yield step * (x-1) + start
        x = (a * x) % m
        if x == seed:
            break

def isPrimitiveRoot(a, n):
    # assuming n is prime then eulers totient = n-1
    phi = n-1
    for p in factorize(phi):
        if(pow(a, phi//p, n) == 1):
            return False
    return True

def factorize(n):
    """
    A naive approach to finding the prime factors of a number n.
    &gt;&gt;&gt; [i for i in factorize(10)]
    [2, 5]
    &gt;&gt;&gt; [i for i in factorize(7*11*13)]
    [7, 11, 13]
    &gt;&gt;&gt; [i for i in factorize(101 * 211)]
    [101, 211]
    &gt;&gt;&gt; [i for i in factorize(11*13)]
    [11, 13]
    """
    if n &lt;= 3:
        raise StopIteration
    i = 2
    step = 1
    last = 0
    while i*i &lt;= n:
        while n &gt; 1:
            while n % i == 0:
                if i &gt; last:
                    yield i
                    last = i
                n //= i
            i += step
            step = 2

4. Linear Feedback Shift Register (lazyshuffledrange3)

from random import randint
# Primitive polynomial roots up to 48 bits, taken from: 
# <a href="http://www.xilinx.com/support/documentation/application_notes/xapp052.pdf" target="_blank">http://www.xilinx.com/support/documentation/application_notes/xapp052.pdf</a>
lfsr_roots = [
    [2, 1],
    [3, 2],
    [4, 3],
    [5, 3],
    [6, 5],
    [7, 6],
    [8, 6, 5, 4],
    [9, 5],
    [10, 7],
    [11, 9],
    [12, 6, 4, 1],
    [13, 4, 3, 1],
    [14, 5, 3, 1],
    [15, 14],
    [16, 15, 13, 4],
    [17, 14],
    [18, 11],
    [19, 6, 2, 1],
    [20, 17],
    [21, 19],
    [22, 21],
    [23, 18],
    [24, 23, 22, 17],
    [25, 22],
    [26, 6, 2, 1],
    [27, 5, 2, 1],
    [28, 25],
    [29, 27],
    [30, 6, 4, 1],
    [31, 28],
    [32, 22, 2, 1],
    [33, 20],
    [34, 27, 2, 1],
    [35, 33],
    [36, 25],
    [37, 5, 4, 3, 2, 1],
    [38, 6, 5, 1],
    [39, 35],
    [40, 38, 21, 19],
    [41, 38],
    [42, 41, 20, 19],
    [43, 42, 38, 37],
    [44, 43, 18, 17],
    [45, 44, 42, 41],
    [46, 45, 26, 25],
    [47, 42],
    [48, 47, 21, 20],
]
def lazyshuffledrange3(start, stop=None, step=1):
    """
    This generates the full range and shuffles it using a Fibonacci linear
    feedback shift register:
        https://en.wikipedia.org/wiki/Linear_feedback_shift_register#Fibonacci_LFSRs
    Here I use a table of precomputed primitive roots of different polynomials
    mod 2. In many ways this is similar to the multiplicative congruential
    generator in that we are iterating through elements of a finite field. We
    need primitive roots so that we can be sure we generate all elements in the
    range.  If we get elements outside the range we ignore them and continue
    iterating. Finally, we need the generator to be a primitive root of the
    selected modulus, so that we generate a full cycle. The seed provides the
    randomness for the permutation. 
    This function has the same args as the builtin ``range'' function, but
    returns the values in shuffled order:
        range(stop)
        range(start, stop[, step])
    &gt;&gt;&gt; sorted([i for i in lazyshuffledrange3(10)])
    [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
    &gt;&gt;&gt; sorted([i for i in lazyshuffledrange3(2, 20, 3)])
    [2, 5, 8, 11, 14, 17]
    """
    if stop is None:
        stop = start
        start = 0
    l = (stop - start) // step
    root_idx = l.bit_length()-2
    nbits = lfsr_roots[root_idx][0]
    roots = lfsr_roots[root_idx][1]
    nbits = l.bit_length()
    roots = lfsr_roots[nbits-2]
    seed = randint(1,l)
    lfsr = seed
    while (True):
        if lfsr &lt;= l:
            yield step * (lfsr - 1) + start
        bit = 0
        for r in roots:
            bit = (bit ^ (lfsr &gt;&gt; (nbits - r)))
        bit = (bit & 1)
        lfsr =  (lfsr &gt;&gt; 1) | (bit &lt;&lt; (nbits - 1))
        if (lfsr == seed):
            break

Evaluation

Before conducting any evaluation, a simple analysis of the algorithms shows that the three new functions use constant memory because we don’t retain any of the elements of the permutations and simply use the last value in the permutation sequence to generate the next. This is substantially better than the shuffledrange function that generates the entire range in memory and shuffles it with the Fisher-Yates algorithm. Next, I wanted to get a sense of how random each approach is, so I plotted the output of each function for generating a permutation of the values from $0 \cdots 100$. Here are the four plots:

These plots show that the most random outputs seem to be the shuffledrange (Fisher-Yates is known to generate each possible permutation with equal likelihood) and lazyshuffledrange2 (the multiplicative congruential generator). The lazyshuffledrange3 function (the linear feedback shift register) also seems to be random, but it seems to have some values grouped together. Finally, the lazyshuffledrange1 (the additive congruential generator) is clearly not random (this is the result I was mentioning earlier that led me to try the two additional approaches).

Next, I tested how fast each of these approaches is. To do this, I used python’s “timeit” function to test how quickly each function could generate a permutation of the values $0 \cdots 1000$. As an additional check of the randomness, I measured the spearman correlation of the generated values with their position in the list. For both the timing and correlation tests, I ran each approach 3 times and picked the fastest of the three runs, as well as the lowest correlation, to get a sense of best case for each approach. Here are the results:

Shuffle Function	Time (sec)	Spearman r	pval
shuffledrange	10.779	-0.085	0.007
lazyshuffledrange1	3.833	-0.032	0.305
lazyshuffledrange2	5.527	-0.043	0.170
lazyshuffledrange3	10.834	-0.015	0.636

Next, to get a sense of how these approaches scale, I timed each function for ranges of increasing size. Here is a timing data plotted for each algorithm for ranges of size 10, 100, 1000, and 10000:

These results show that the lazyshuffledrange1 (the additive congruential generator) and lazyshuffledrange2 (the multiplicative congruential generator) functions are the fastest and seem to scale well with increase in the size of the range. In contrast, lazyshuffledrange3 (the linear feedback shift register) is slower than the shuffledrange (the Fisher-Yates) function. This is likely because my implementation of lazyshuffledrange3 uses precomputed primitive roots for mods that are powers of 2. So the number of values outside the range that are still in the finite field tend to be large, so the function takes more iterations to generate valid values.

As a final test, I wanted to see if repeated calls to each algorithm were uniformly sampling from the space of possible permutations. To evaluate this, I chose a small range ($0 \cdots 6$) and called each function 1000 times given this range. Here I plotted the histograms over the resulting permutations:

These results show that the shuffledrange (Fisher-Yates) seems to approach a uniform probability of each permutation (although the pydocs make note that for large n, not all permutations may be possible). Our more memory efficient approaches seem to have a pretty uniform probability of each permutation that they generate, but they are all limited in the number of permutations they can generate. The lazyshuffledrange1 (the additive congruential generator) seems to generate the most permutations, and lazyshuffledrange3 (the linear feedback shift register) generates the smallest number of permutations. Lazyshuffledrange2 (the multiplicative congruential generator) is somewhere in the middle. If one is interested in generating the full space of possible permutations, then this is a troublesome result. On the other hand, the results do make some sense. The number of permutations generated corresponds roughly to the number of possible states for the generators. For example, lazyshuffledrange1 accepts a generator $a$ between 1 and the modulus and a seed between 1 and the size of the range. If our range is length 6, then the modulus is 7 (the smallest prime larger than 6) and $a \in [2, 6]$ and $seed \in [0,5]$, so there are 30 possible initial conditions, which agrees with the number of permutations we see in the histogram. By contrast, the lazyshuffledrange2 function also uses the modulus 7 (the smallest prime larger than the size of the range), but $a \in \{3, 5\}$ (there are only 2 primitive roots mod 7, see this table) and $seed \in [1,5]$ (a multiplicative congruential generator cannot start off at 0), so there are only 10 possible initial conditions, which agrees with the histogram results. Finally, the linear feedback shift register has fixed primitive roots (because I used a prebuilt table), so its only parameter is $seed \in [0,5]$, or 6 initial conditions, which also agrees with the histogram results. In order to support more possible permutations we could increase the prime modulus of these generators until the number of possible permutations is large enough. However, this will take more computing power (because more of the values in the finite field are outside the range and more iterations will need to be performed to get values within the range).

As a basic test of this idea, I increased the modulus of the lazyshuffledrange2 function to be the smallest prime larger than 2 times the size of the range, which resulted in the following histogram:

As we can see, this doubling of the number of possible states in the generator yields approximately a doubling in the number of permutations it can generate.

Summary

I implemented four algorithms, one is the naive Fisher-Yates shuffle (shuffledrange) and the other three are more memory efficient approaches to lazily generating the values of a random shuffle (i.e., permutation without repetition) of a list. These approaches are the additive congruential generator (lazyshuffledrange1), multiplicative congruential generator (lazyshuffledrange2), and the linear feedback shift register (lazyshuffledrange3). All three approaches are based on finite field theory. I found that the fastest approach that also (by my very naive eye-ball test and basic correlation analysis) seems to generate suitably pseudorandom shufflings was the multiplicative congruential generator. It also seems to have uniform probability over the possible permutations it can generate. However, the number of possible permutations it can produce is limited by the number of possible states that can be represented in the multiplicative congruential generator, which is determined by the modulus of the finite field. If more possible permutations are needed, then the modulus can be increased. This results in more possible permutations that can be generated at an additional computation cost. In situations where only a few permutations are needed, then it is probably a sufficient memory efficient approach to lazily generating the elements of a shuffled list.

If any readers have thoughts on how I can increase the possible permutations that can be generated by any of my proposed approaches, I would love to hear about it!

Bitcoins a Beginners Howto

2016-03-27T00:00:00+00:00

I have been hearing about bit coins a lot lately (here and here) and so I decided to check them out and give a basic overview of how to get a bit coin system running on Ubuntu. It was much more confusing then I thought it would be but I eventually got it working. Of course I immediately saw a number of interesting possibilities, which I will discuss at the end.

First, the easiest way to get started is to download the pre-compiled binary files for linux (available here). The gui for linux isn’t working at this time because ubuntu doesn’t have wxwidgets2.9 yet but the command line works great. Once you have the binary files downloaded extract them:

tar -zxvf bitcoin-0.3.22.tar.gz

Then make a bitcoin directory and bitcoin.conf file:

mkdir ~/.bitcoin
echo "rpcuser=un" > ~/.bitcoin/bitcoin.conf
echo "rpcpassword=pw" >> ~/.bitcoin/bitcoin.conf
echo "gen=0" >> ~/.bitcoin/bitcoin.conf
echo "rpcallowip=127.0.0.1" >> ~/.bitcoin/bitcoin.conf
echo "paytxfee=0.00" >> ~/.bitcoin/bitcoin.conf

With the directory created and the configuration file made go ahead and start the bitcoin server:

~/src/bitcoin-0.3.22/bin/64/bitcoind -daemon
bitcoin server starting

Once the server starts it will take quite a while to download the chain of blocks from the other bitcoin peers (for info on what this means check out this). It took my computer about 2-3 hours.

Once the server has finished loading the blocks you can download a client and have it request work from the server and start computing.

To check the status of your server you can run the following commands (list of commands here):

./bitcoind getinfo

This gives you information about how many blocks you’ve downloaded etc… I checked bitcoin block exploror to see how many blocks there were (129,000+ when I did this). I used this a lot because bitcoind never seemed to give me any output so I never knew what it was doing.

As far as the actual bitcoin calculation there are a couple of ways to do it:

I used pyminer because it is simple, the code is human readable, and it works with little to no hassle.

You could also have bitcoind compute directly by editing ~/.bitcoin/bitcoin.conf and replacing:

gen=0

with

gen=1

Then when you run bitcoind-daemon it will also be computing bitcoins.

The added benefit of doing things this way is you could edit the bitcoin.conf file to allow other ip addresses. Then you could have all of your clients connect to the single server to request and process work (essentially creating a local pool) with all of your bitcoins being stored in one place (this may or may not be a bad thing). You could also use this server to feed bitcoins up other ways since the bitcoind server handles json-rpc requests. I was even thinking it would be quite easy to make a javascript file that you could include in your website that could connect to your bitcoind server. Then you could harvest cpu power from your web traffic, an interesting idea that others have experimented with (http://www.bitcoinplus.com/generate). I found a general lack of the absolute basics on how to get the bitcoind running on linux/ubuntu (mac and windows have gui’s) and I hope this clears up some of those basics.

Installing and Running Opinion Finder for Sentiment Analysis

2016-03-27T00:00:00+00:00

A walkthrough of how to install OpinionFinder 1.5.

Overview

For my social media mining project on twitter sentiment aggregation I need a working version of University of Pittsburgh’s Opinion Finder 1.5.

I went to the website here: http://www.cs.pitt.edu/mpqa/opinionfinder_1.html and requested version 1.5.

First unpackage the download and enter the directory:

tar -zxvf opinionfinder.tar.gz
cd opinionfinder

Installing Sundance

The first part of the install is installing sundance

To do this you’ll need csh

sudo aptitude install csh
cd software/
tar -zxvf sundance-4.37
cd sundance-4.37/include

Open sunstr.C and uncomment the line

/* #include <stdlib.h> */

Open sunstr.h and edit the following include line:

#include <string>

to be

#include <string.h>

Then go to this site (at the bottom of the page) and download the hash.h file here. Replace the current hash.h file (in the /include) with the one you just downloaded.

Lastly, compile the file:

cd ../scripts
./install_script

That was what was necessary for me.

Installing scol1k

Next you need to install scol1k

cd software/
tar -zxvf scol1k.tgz
cd scol1k.tgz
cd tools

edit select.c and change line 84:

target = *((int *)lines)++;

to be

target = (*((int *)lines))++;

Then return to the home directory for scol1k and compile:

cd ..
./configure
make
make install

I received some error at the end of running the make command (the make file ran reg on the e8.reg file and got a seg fault).

I ignored this and ran make install anyways… because I think only the stemmer is being used by opinion finder.

Installing Boostexter2.1

Next I moved on to installing BoosTexter 2.1 which I got here. I was able to get the i386 binary for boostexter2.1 and it ran without any problems

Installing Wordnet 1.6

Lastly, I had to get a copy of Wordnet 1.6 here.

It was pretty easy to install

cd wordnet-1.6/

edit the make file for your distribution (uncomment the appropriate line):

#platform=linux

becomes

platform=linux

Then I ran the make script to install the binaries:

sudo make BinWorld

Installing Sundance Apps

cd software
tar -zxvf sundance_apps-4.37
cd sundance_apps-4.37

Then I edited the make file to include the proper path to the sundance directory.

then I installed it:

make autoannotate

Installing Opinion Finder

Edit the config.txt file in the opinion finder directory to point to all of the software packages we just installed.

Lastly, run the install script:

python install.py config.txt

Running Opinion Finder

I followed the directions in the README

I copied the files in the examples into the database/docs folder

cp -R ./examples/marktwain ./database/docs

then I edited the twain.doclist file to have the appropriate path for each file

then I ran opinionfinder command in the opinionfinder directory:

python opinionfinder.py -f ./examples/twain.doclist

That’s it! Best of Luck

Monetizing a Website with Bitcoins

2016-03-27T00:00:00+00:00

This article is a continuation of my last one and focuses on the topic of Bitcoins. Instead of how to mine bitcoins on your computer I am going to discuss how you can use other people’s computers to do all the hard work using Javascript and PHP.

Mining for Bitcoins with Javascript

A recent development in the bitcoin community is Javascript Bitcoin miners. These miners aren’t particularly fast but they allow you to utilize the users viewing your website to do your mining. The miners I was able to find (such as Bitcoinplus) all appeared to charge a fee (I think it was 19% for bitcoinplus). I didn’t want to pay a fee but I couldn’t find any working javascript miners available online. To solve this problem I found the project closest to working and made the changes to the code necessary to make it operational. You can get the working code at github.

Mining for Bitcoins with Wordpress

I then took this one step further. See numbers at the top of this post reading total hashes and hashes per second? I took my code and created a Wordpress plugin that easily integrates Bitcoin mining into your wordpress site. Simply activate the plugin and presto your readers are mining Bitcoins. Those numbers at the top of the page are the stats for your computer as you view this page. Now that I’ve made the plugin I’ll be releasing it on the wordpress plugin directory as soon as I can. You’ll be able to configure the plugin with whatever bitcoin server or pool you are using and be generating in no time.

Converting traffic into money

The current exchange rate for Bitcoins is around $26 US per 1 BTC (bitcoin). I got this exchange rate at the biggest bitcoin exchanger Mt. Gox. Finding a hash results in a reward of 50 BTC or approx. $1300 US. Of course the chances of finding a hash are slim to none but the more traffic you have… the better your chances.

Efficiency

Currently the best javascript miners were about 10-20 times faster then mine but now that I have everything up and running I can work on efficiency. If anyone would be interested in contributing please let me know.

Conclusion

Mining for Bitcoins is an interesting and unique way to monetize your website which can possibly give you occasional big payouts. While the speed of hashing is lower with the javascript, large numbers of viewers may lead to a substantial number of hashes being tested. Keep and eye out for the plugin on wordpress.

Future Work

There may be interesting developments in this area with the release of GPU libraries for javascript enabling one to tap into the power of the video cards of their readers, though I don’t think the technology is ready quite yet. I’ve also had discussions with a few of my friends about creating an email signature which loads the javascript files from your website. This would result in bitcoins being generated when people view your emails. I’m not sure if it would work but it is certainly an interesting idea.

Reading a Thermistor

2016-03-27T00:00:00+00:00

An exploration of how to read a read a thermistor (a temperature monitor) by measuring the thermistor’s resistance using an analog input link on a USB Bit Wacker (UBW).

Overview

I have been struggling over the past week to come up with a solution to read a 10k NTC thermistor for my work. I eventually purchased a USB Bit Whacker (UBW) to perform my A/D conversion as well as satisfy my needs for gpio. Hardware in hand I had to figure out how to measure resistance with a analog input.

A friend of mine recommended that I use an LM317L to regulate a constant current and just measure the voltage through the thermistor. While this is a good idea for small resistances to read a 10k thermistor I may be looking at as much as 15k ohm resistance for low temperatures. Therefore if the current cannot be regulated at a very small very precise amount then there will be a problem of producing voltage exceeding the 5V tolerance of the A/D card (I would need like <.0003 A or something). I considered putting a resistor in parallel with the thermistor to limit the resistance and prevent voltage that will damage my A/D card. But when calculating the temperature from the resistance from the voltage any extra steps introduce measurement errors that begin to multiply and add up. This led to low accuracy resistance/temp readings.

After exhausting this option I came across a very simple solution to this problem using a voltage divider. Since the UBW has a 5V VCC line I just used a simple voltage divider with the thermistor to create a voltage between 0-5V that is proportional to the resistance to the thermistor. Here is the schematics and the Temp Vs. Voltage graph:

This solution was significantly better. It was very simple to implement requiring only one resistor (I don’t even need a separate power source or current regulator). I could measure within 10ohms of the resistance that the meter measured. NOTE: This accuracy highly dependent on the accuracy of the measured voltage which sometimes was accurate and sometimes wasn’t do to the variance in the Vref from the USB. At home I measured it at 4.8V, at work 5.0V. If this isn’t set correctly in the software that converts the 12bit analog value into a voltage then the voltage measurement will be off, and so will the resistance/temp measurement.

All in all I am very pleased with the UBW and am looking to add USART support and SPI to the firmware provided on the UBW Project Page. I’ll post my results here.

Sentiment Analysis of Tweets using Ruby

2016-03-27T00:00:00+00:00

A description of a very simple twitter sentiment analyzer project. This system loads tweets on a user specified topic, scores the sentiment of each word in each tweet, computes an overall sentiment of each tweet, and then tallys the positive and negative sentiments for the topic.

Overview

Twitter has become an international web phenomena where people report their everyday ideas and opinions. Along these lines sentiment analysis of tweets has been seeing a lot of attention lately. There have been articles in Wired Magazine and Bloomberg about using twitter to predict stock market trends. Work by economists at Technische Universitaet Muenchen (TUM, the Technical University of Munich) has even resulted in a website that gives free stock ticker predictions based on twitter. All of these articles really interested me and I think there will be more and more demand for the ability to mine social media data for opinions and sentiments. For these reasons I decided to see if I could make a basic twitter sentiment analyzer with the idea of making it more complex once I master the basics.

For my project I decided to use Ruby to access Twitter’s Search API. It turns out it was extremely easy to use and did not even require any type of registration or authentication. For the sentiment analysis, I used a simple word list that I found online (turns out my project idea was also a class project at UMBC) mapping words to sentiment scores on the range of [-1, 1].

The gist of my basic sentiment analysis algorithm was to gather all the tweets that matched the given search term (Twitter’s Search API pretty much took care of this) and for each tweet take the sum of the sentiment values of the words in the tweet (where the value is 0 if it doesn’t appear in my wordlist). If the sum was greater than some threshold (something like 0.00 or 0.40) then the tweet would be deemed to have positive sentiment. If the sum was less then the negative threshold (something like 0.00 or -0.40) then the tweet would be deemed to have a negative sentiment. Anything else would be deemed neutral. Then once all the tweets have been classified as positive, negative, or neutral you could use the ratio of positive to negative tweets to determine the general sentiment for the search term.

As a rough estimate of sentiment this algorithm works great, mainly because it is so easy to implement. For serious sentiment analysis you would probably want something more complex. This algorithm would do horribly with sarcasm, multi-subject tweets, tweets not expressing explicit opinions (questions for example), etc. Now that I have a basic system to collect and analyze tweets I think I’ll be performing future work to better analyze the sentiment and opinions found in these tweets.

Here is the Ruby code that I used for collecting tweets and performing this basic analysis in addition to the files with word sentiment:

Twitter Sentiment Analysis Code (github) (Mostly Ruby code)

Teaching a Computer to Play TicTacToe

2016-03-27T00:00:00+00:00

A simple linear regression based approach for teaching a computer to play Tic-Tac-Toe. The system learns to play better by repeatedly playing against itself and learning to recognize good and bad moves. This is a solution to one of the problems in Chapter 1 of Machine Learning.

Approach

I just finished the first chapter of Machine Learning (Mcgraw-Hill International Edit) by Tom M. Mitchell. This chapter discusses the very basics of how to write a learning system and serves as an introduction to the rest of the book. As part of reading the chapter I decided to do exercise 1.5:

Implement an algorithm similar to that discussed for the checkers problem, but use the simpler game of tic-tac-toe. Represent the learned function Vestimate as a linear combination of board features of your choice. To train your program play it repeatedly against a second copy of the program that uses a fixed evaluation function you create by hand. Plot the percent of games won by your system, versus the number of training games played.

I’ve taken numerous courses on artificial intelligence but I’ve only solved problems such as sudoku solvers and classification learners. While this is a simpler problem, it is a system that when done will be able to provide a challenge to my own wit. I started out very curious as to how well a simple linear function learning system will be able to learn how to play the game.

To complete this problem I created a function that evaluate board states using a linear function which takes a hypothesis (7 weights) and features (6 of them) that are extracted from the board state. When playing the game the learner (the computer) gets all of the legal moves and applies the evaluation function to the new states to learn which move gets the highest rating from the evaluator, which it then acts on.

The six features that I extracted from every board state were (a row is 3 subsequent squares… the rows, columns, and diagonals):

x1 = # of instances where there are 2 x’s in a row with an open subsequent square.
x2 = # of instances where there are 2 o’s in a row with an open subsequent square.
x3 = # of instances where there is an x in a completely open row.
x4 = # of instances where there is an o in a completely open row.
x5 = # of instances of 3 x’s in a row (value of 1 signifies end game)
x6 = # of instances of 3 o’s in a row (value of 1 signifies end game)

I would give a the learner some random weights/hypothesis to start (I set w0,w1,…,w6 all equal to .5). Then I played it against another learner (with the same starting weights). After the game is over I generate training data from the game to use to refine the weights for the next game.

Vtrain(boardstate) = 100 if end of game and you won.
Vtrain(boardstate) = -100 if end of game and you lost.
Vtrain(boardstate) = 0 if end of game and a draw.
Vtrain(boardstate) = Vestimate(successor(boardstate)) in not the end of the game

Using this generated training data I update the weights using the least mean squares (LMS) method.

for each pair <boardstate, Vtrain(boardstate)>:

    use current weights to calculate Vestimate(boardstate).

    for each weight wi, update it as
       rwi = wi + learningConstant*(Vtrain(boardstate) - Vestimate(boardstate))*xi

(where learningConstant is a small constant like .1 or something that controls the rate at which the weights are updated) Once the weights are update the system is ready to player another game even smarter then the last!

The end results of this project were alright. I trained the system against a player that randomly chooses moves each turn 10,000 games before I played it. The computer was capable of leading most games into a draw, and only lost if I was really tricky! That being said I think my project was a success. On the other hand I was capable of beating it, and it has been shown that if you play perfect games you never lose (see the TinkerToy computer that has never lost a game here).

With chapter 1 complete I am ready (and excited!) to start on chapter 2.

I’ll keep you updated.

p.s. here is my code if anyone is interested (it is probably really buggy as I whipped it up pretty quick). It is written in python: TicTacToe.py.

The power of the subconscious? I think not.

2016-03-27T00:00:00+00:00

An argument against the theory that our brains continue to work on hard problems when we take breaks. Instead, I argue that we forget failed attempts, but maintain important metadata that facilitates a rapid solution on follow up attempts.

Overview

Insight problems, or problems that are nearly impossible to solve without the crucial insight (with which they become trivial), are an interesting class of problems lead individuals to interesting conclusions about the subconscious mind. Namely, the experience of insight hints at our subconscious autonomously solving problems for us and magically making the solution aware to our conscious mind after the hard work has been done. I would argue that this conclusion is ill-founded and that it is quite unlikely that our subconscious performs autonomous problem solving. Instead the act of taking a break or resting leads our mind to go through a forgetting process where we eliminate false assumptions that prevented us from finding a solution. With these limiting false assumptions forgotten, the solution then becomes within reach of our conscious mind.

Lately I have been studying insight and creativity in people and trying to figure out how such processes can be modeled with computers. Insight problems are tricky problems that have the distinctive characteristic that they are nearly unsolvable until one possesses the necessary insight to solve them, at which point the problems become trivial. As an example see the following figure. This figure composes the nine dots problem. The goal is to connect all the dots by drawing four lines without picking up your pen and never retracing your lines (solution available here). This problem is one of my favorite because it is nearly impossible to solve the problem without first seeing the solution but once you see the solution you will forever be able to solve the problem. The process one goes through to solve these problems happens to be very similar to that which people go through when solving real world problems.

Nine Dots Problem: connect all the dots by drawing four lines without lifting your pen or retracing a line.

One of my favorite stories of real world insight is when Archimedes discovered the principle of displacement. It was said that he was tasked with determining if the king’s crown was in fact made of pure gold. To perform the necessary calculation he needed to know the volume of the crown (having known the density of gold and the weight of the crown) but since it was irregularly shaped he had no way of calculating it, short of melting it down (which he couldn’t do). After tirelessly attempting to solve the problem Archimedes finally gave up. To help relax after his hard work he decided to take a bath. Upon slipping into the water he observed that the water level rose due to the displacement of water by his body. Eureka! He immediately knew how to calculate the volume of the crown by submersing it in water and measuring the volume of water that had been displaced. This process of futile attempts to solve a problem followed by giving up and then having crucial insight leading to a solution is a pattern across all fields and disciplines.

When originally approaching the problem I shared a similar prospective to that of the French mathematician Henri Poincaré. That when one works hard enough on a problem the chunks of information become available to the subconscious mind, enabling it to work on your behalf while sleeping and performing other activities. When the subconscious mind solves the problem it makes it aware to your conscious mind and you experience a flash of insight and immediately know the solution. This hypothesis often called the autonomous-process hypothesis is a very popular theory among many scientists who have experienced this (myself included). After throughly researching the subject I am convinced that this hypothesis is ill-founded.

Instead what I believe takes place is more of a selective forgetting process. When we originally face problems we make assumptions about the problem that limit the search space to one that can be tractably explored. In insight problems these assumptions lead to the construction of a problem space that lacks a valid solution. When we take a break from the problem solving we begin to forget the original assumptions that we made about the problem. This enables us to make new assumptions the next time we see the problem (most likely new assumptions because we know the old ones didn’t work). These new assumptions may lead to a search space which does in fact contain a solution, making the solution trivial.

This explanation for insight while less magical is more likely to be correct as it is supported by the literature that I have read. It would appear that all problem solving is a conscious mindful task and the subconscious mind isn’t some mystical calculator which is more powerful then our conscious mind. It makes me think that movies like “Limitless” which purport making someone super intelligent by tapping into their unused brain power is in fact preposterous no matter how good of a movie it makes.

This does hint at some interesting future work for artificial intelligence. How does the human mind make search space limiting assumptions? These assumptions make problem solving tractable for humans and could probably do so for computers. Additionally how does one know which assumptions to forget and which to keep? These are problems I am currently exploring.