Imran's personal blog

May 9, 2016

What am I up to in 2016?

Filed under: Uncategorized — ipeerbhai @ 3:01 pm

For the past month, I’ve been working on a machine learning program, accidentally.

A year or so ago, I wrote a little app that uses cloud AI to do language translation.  It worked!  Only for me!  See, I grew up in the American Midwest.  I actually went to the University of Nebraska for a while.  I speak broadcast perfect — I could be a news anchorperson.  I also understand AI.  In machine translation, I understand that it’s just “transcoding” based on word frequency, Kenneth. This means I can have this kind of conversation with myself:

“How many dogs do you have?”/ “I have two dogs”.

So, because of these factors, I can use a translation AI without problem.  But I often interact with people who are older, have strong accents, and don’t really understand the processing time and optimal speech patterns for cloud machine translation.  They speak differently:

“How many dogs do you have?” / “two”.

Fragmented, fast, impatient, and ambiguous.  A machine system won’t handle this conversation well.  The accented, older human is now just frustrated with the thing.  They didn’t have enough clue from the system of what was going on, and it took too long for it to work.  They want “Effortless” translation, or they don’t believe/trust it at all.

So, I wanted to solve the problem of conversational translation along with a slew of other problems like contact search.  Thus, I stepped through the looking glass and decided it was time I learned AI development.  I went looking for frameworks, and discovered Encog, a C# neural network/ ML framework, and played around with it.  I discovered the amount of featurization and pre-processing for sound NN was higher/harder than I liked.  It could be done, but only with a metric tonne of labeled data — data I don’t have.

So, I looked at “small data” ideas.  One that interested me was the two-dimensional vector field learner that Numenta has.  I began a pure C# implementation ( I normally don’t code in C# because I hate UWP — but this kind of project uses old .NET APIS and no UWP).  And along the way, it hit me — this two dimensional learner was a neural network, and machine learning is really just pattern recognition.  The sparse maps are like labels — another way of saying, “Like these, not those”.  The two dimensional field could be represented by a vector of A elements, where A = M x N of the original field.

But there’s power in the representation that I hadn’t expected.  Turns our that viewing NN as a two dimensional vector and using masking leads to easier human understanding of what the heck is actually going on in the system.  And this leads to new ideas ( which I’m not ready to share yet, because they’re possibly insane ).

Now a days, I’m developing out the system because it’s intellectually engaging.  I’ve started from Ideas, seen how they work in existing frameworks, then moved and maybe improved those ideas into my own framework — because I believe “if you don’t build it, you don’t understand it”.  My framework is woefully incomplete.  It will always create a pattern based on the least significant bits.  It’s easy to fool, and doesn’t use enough horizontal data when building masks.  But it can do something amazing — it can tell apart two sounds with exactly one sample of each sound, and does so without a label.

And that’s not the most exciting part!  As I’ve been playing with these ideas, a new one has emerged about how to stack and  parallelize the detectors and make an atemporal representation of sound streams.  This seems to match what Noam Chomsky says about how human “Universal Grammar” must work. If this idea pans out ( and it’s maybe months of implementation time to find out ), then there’s a small chance that I’ll figure out some part of the language translation problem.

All that excitement is tempered by the fact that I have limited time.  Eventually, I’ll run out of money, and thus time, to do this research.  So the problems I must solve are:

  1. Can I build a framework that’s able to solve the problems I’m interested in?
  2. If not, can the pattern detectors solve problems others are interested in?
  3. Can I sell something from this system to fund my own time?

Anyways, that’s what I’ve been up to recently.

Echo needs a competitor

Filed under: Uncategorized — ipeerbhai @ 2:04 pm

One thing I learned working in big tech — there’s always someone watching.

Take the Amazon Alexa.  You can bet the big 4 tech firms are watching Amazon and trying to decide if they’ll make technology to compete.  And I really wish they would.  I have an echo and love it, but programming for the Echo is crap.

Why is programming for the echo crap?  So many issues:

  1. provisioning services is a nightmare.  You do’t even know what services you need to provision, much less have access to a configuration file.  Lots of AWS console pages — lots — to get to hello world.
  2. No audio stream.  If you want to make a phone app, forget it.  Amazon won’t give you the voice data.  There’s AVS that you can use to send voice you capture to Alexa — but there’s no access to the voice in the Echo.
  3. 90 second, fixed format playback from the API.  You literally chunk everything as 90 second long mp3s.
  4. NodeJS.  Voice is not web, and the stateless nature of web design makes no sense in voice apps.  The biggest issue is that your app will respond to any of the registered commands in any sequence.  Conversations, however, are always sequential.  It’s just the wrong language for the job.
    1. NodeJS, outside the web, is sort of a problem.  There’s real harm in imposing the async paradigm on problems that are much simpler to read in a stateful manner.
    2. And not just any NodeJS — you can’t write the code in your own editor.  Amazon wants to make sure they own the coding platform, and you have to write Alexa code in their web editor.
  5. Can’t really sell what you make.  Amazon won’t let you monetize the actual ASK — instead, you have to sell something else, like an unlock code, on Android.
  6. AVS platform lock — AVS is essentially only available for Linux/Android.  If you want to use AVS on PC/Mac, well, you’re SOL.
  7. Overly cloudy.  I’m not a fan of the cloud, because it adds complexity that doesn’t need to be there.  But Alexa takes the cake on too much cloud for no reason.  Can’t write the code on local system — must be in browser.  Can’t run any part on your own hardware — must run in AWS.  Evey instance requires a lambda spin-up.  Can’t sell what you make.  Developers give too much control away when using Alexa.

My team won the Echo prize in the recent Seattle VR Hack-a-thon.  The team at Amazon is amazing, and echo is an amazing product.  Again, I own and love my echo.  But without a competitor, the developer experience is really sub-par.  I also don’t like these cloud companies forcing devs to lock in to them — can’t even use your own editor?  Come on!

So that’s my argument — that Echo needs competition from the big tech companies.  Sure, some start-up can make a great echo-like product with a better developer experience.  There are small-shop products that make similar products that I run across on KickStarter/Indiegogo.  But those companies are vertically focused — no developer experience at all — where the big 4 make APIs…

April 21, 2016

codec2 sparse map

Filed under: Uncategorized — ipeerbhai @ 12:33 am

I’ve been playing with Codec2 and sparse maps.

Sparse maps are an idea I saw from an AI firm ( numenta? )  about how to visualize and filter vector arrays.  The basic idea is that you take a vector ( can be a binary vector, but could be a vector of ints ), assign some color value to some numbers, and spread the vector over a 2d map.  From this map, you can find some number of clusters, and those clusters are essentially concepts.  You build masks of these concepts to see if an output contains the concept.  They use it in natural language search.

I took an open source codec called codec2 and build a sparse map of 71 frames of me saying, “ah” and “sh”, and put that into a 640×480 picture from the codec’s 51-bit frames.

So, a frame is a vertical line in the picture.  bit 0 is the top, bit 51 is the bottom of the line.  Each 9×9 block represents either a 1 or a 0.  Red colored 9×9 blocks are 1, Green 9×9 pixel blocks are 0.  Frame 0 is the leftmost vertical line, and frame 71 is the rightmost.  There are no spaces between the colored blocks, so it looks continuous, but is really discrete blocks.  I did this in C#, so there will be byte order flipping issues which I haven’t corrected for ( essentially, BitArray.Copyto(byte[]) will copy in little-endian order which I then bitshift back into order, even though the bit array is in concatenated bit order — something I’ll fix later, but this error is consistent, so the generated color map is also consistent ).

The results are staggering.  Here’s the picture of “ah”


Here’s the picture of “sh”


These maps look interesting — I think a filter masks might be able to detect either:

  1. my voice.
  2. the phones being spoken.

Of course, this could be a dead end.  I haven’t seen if I can generate masks from this yet — but it looks super interesting, so I thought I’d share.


March 8, 2016

Why UWP must die

Filed under: Uncategorized — ipeerbhai @ 1:46 am

There’s a brew-haha going on about UWP.  I hate UWP.  Here’s why:

  1.  UWP is a dangerous fork of .NET.
    1. MS has not been keeping .NET non-UWP up to date.  For example, the desktop Cortana APIs are UWP locked.  But, you can use “Cortana” via azure in a convoluted way, or you can use straightforward APIs within UWP.  But, you can’t use Cortana in a straighforward way in .NET.
    2. Even when the API is in both UWP and .NET, the documentation is not updated for .NET. I’ve run into cases where the docs are UWP only, and the .NET version of the API has a different calling convention.
  2. UWP detracts from .NET improvement.  MS is using too much developer time keeping two forked APIs that do the same thing.  Nothing is stopping MS from updating .NET and bringing it to all platforms.  Nothing is stopping MS from making store APIs part of .NET.  .NET already supports strong cryptography, include strong-name signed dlls — everything that UWP is supposed to solve, .NET already provides.
  3. I hate developing UWP.  So much, that I’ve abandoned .NET development.  All new dev work I do is in NodeJS.  This is because UWP keeps creeping into things.  Starting VS?  You get pestered with ,”Where’s this month’s license?” even in VS community.

I love C# and .NET — really, I do.  I *want* to develop in the MS stack.  UWP has driven me away.  I can’t trust APIs I want are present.  I can’t trust the API docs.  I can’t get away from hassles.  Don’t get me started the annoyance of things like NuGet ( how do you debug a NuGet package deployment failure?  That’s a nightmare…  npm is so easy — rd /s /q node_modules and npm install… ) and the reducing number of devs.

To get me to reconsider the MS platform as a serious developer platform, UWP must die.

February 28, 2016

Tech’s diversity problem

Filed under: Uncategorized — ipeerbhai @ 8:32 pm

The New York Times recently ran a story addressing Tech’s diversity problem:

In the story, they wrote about similar problems the Boston Symphony Orchestra had with diversity back in the 60s.  Here’s the bit I find fascinating — The Boston Symphony of that time rarely got female applicants, but when they switched to anonymous auditions and started hiring more women as a result, they started to get more women applicants, too!  It seems people are rational — they won’t apply to something if they believe they won’t get in.

In the tech community, there’s a lack of diversity — with many women and some racial minorities not applying to positions.  I’ve always thought this — why would they apply if they know (1)they’re less likely to get in; and (2) are less likely to advance?

This feedback effect — lack of diversity causes lack of applicants which causes more lack of diversity — is a loop that must be broken.  Many people I’ve talked to in tech blame the, “But X group never applies to our positions!  We’d hire them!” as a “true excuse” for not hiring diversity.  The statement is true — many positions open don’t have diverse applicants — yet the underlying cause is the existing lack of diversity.  Big tech would actually have to have reverse discrimination in place to counteract the existing structural problems — but big tech believes in the myth of meritocracy ( which I do not believe in — As Adam Smith pointed out hundreds of years ago, people are more similar in ability as individuals than different. ), and simply cannot see the forest for the trees.

This structural problem explains a lot.  Why are women good at math until the 3rd grade?  For the same reason that pre-school kids normalize achievement when they reach 3rd grade — that’s when there’s enough cognitive ability in a human to see structural bias.  The girls see the structural bias against them in society and redirect their efforts to where their payoff likelihood maximizes relative to others making the same choices.  This is a weird concept — Let’s pretend you’re going to be a “code Janitor”.  This “code janitor” is the idea of the worst job you can have as a developer, whatever it may be, in your company.  It likely is still well paid relative to a receptionist.  So, the purely rational choice would be to strive towards the code janitor position instead of the receptionist.  So, why are women and minorities more likely to strive to being the receptionist?  Because they have a more fair chance at getting the entry position in reception and can advance to the pinnacle of the field unfettered — whereas, as a developer, they’ll face higher hurdles to entry and advancement.  Because humans judge themselves relatively — a high-level receptionist will judge himself against low-level receptionists — it is rational to strive towards reception instead of technology.

The same applies to the pre-school kids ( who are educationally advanced beyond other 3rd graders ), who see the structural bias caused by normalized grading, and adjust their efforts.  These effects show up universally in 3rd grade because humans essentially gain cognitive abilities at very similar rates until they succumb to the incentives in their environment.

Thus, in tech, there will be lots of subsidies thrown at ineffective solutions to the diversity problem — because the core problem is structural, and humans intelligent — the amount of money thrown at education and diversity efforts are too little compared to the expected lifetime earnings differential a woman or minority expects to see.

February 18, 2016

Feed Forward NN in Matlab

Filed under: Uncategorized — ipeerbhai @ 8:21 am

Matlab is interesting because of the emphasis on vector math.  I’ve been looking at a simple feedforward vector matrix for neural networks in matlab.  Here’s the basic concepts of how to impliment one ( so I can do it again if I ever need to… )

Pretend I’m given a 3 layer network, 1 input layer, 1 hidden layer, and 1 output layer.

The function prototype is predict(t1, t2, X)

where t1 is a matrix (a,b) with a = the number of neurons in the next layer, and b being the number of predictors + 1 for each sigmoid activation function.

and where t2 is a matrix ( c, d) with c = the number of neurons in the output layer, and d being the number of predictors + 1 for each sigmoid activation function.

The number of entities we need to make predictions for is size(X,1);

Here’s a simple for loop to run the weights with a bias neuron in both Input and Hidden layers:

for thisEntity = 1:size(X,1)

thisInputLayerAsVector = [1; X(thisEntity , :)’]; % bias neuron + inputs.

% next, need to feed this forward to the hidden layer.

FeedForwardToHidden = [1; sigmoid(t1 * thisInputLayerAsVector)]; %bias + sigmoid of first weights.

FeedForwardToOutput = sigmoid(t2 * FeedForwardToHidden); % output to the number of final classifiers.

After you run this, FeedForwardToOutput will contain a vector “score” for a single entity line, with “1” in the “macthing” positions of this vector, and “0” in the non-matching.  Ideally, you should only have one “1” and the rest “0” for multi-class classifications, but that’s a function of training, not this math to compute the forward values of the NN.  Now, you’d need to figure out how to convert this score vector to something that makes sense to your use case.

February 14, 2016

Notes on Sigmoid functions.

Filed under: Uncategorized — ipeerbhai @ 2:24 am

This is a quick list of functions for computing regularized logistic regressions:

The logistic equation in matlab format is:

h = 1 / (1 + exp(-x))) %  In the case where x is a vector or matrix, you’ll want to use an index loop for rows and columns, and compute for the entire vector or Array and assign to a return vector/array.

This creates a cost function that looks like this:

and gradients that look like this:

sigX = sigmoid(X*theta); % this is a vector of sigmoid values, as theta is a vector.

grad(gradCtr) = 1/m * ((sigX-y)’ * X(:,gradCtr)) + lambda/m*theta(gradCtr); % remember sigX is a vector of values where each element is between 0 and 1, and y is a label vector with elements exclusively 0 or 1.

Here, gradCtr is a loop iterator from 2 to the end of the theta array.  You car arrive at the above by taking the derivate of the cost function and simplifying it.

February 7, 2016

Vectorized Gradient Descent in Matlab

Filed under: Uncategorized — ipeerbhai @ 10:43 pm

Gradient Descent is often implemented in two different ways.  The first is via a nested loop, with an outer loop controlling an update counter, and an inner loop computing the error gradients.  The second way is essentially the same as the first, except optimized second loop libraries are available if you can express your error gradients in the form of a vector equation.

First, some background — what’s an error gradient?

In gradient descent, you create something called a cost function.  This cost function is some computation of average error for the entire data set, then you multiply this real number by the column vector from your data of your x values, multiple this by a guess called alpha, then update the value of your guess for the co-factor you’re solving for.

Here’s how you do that in Matlab.

First, assume I have this equation and data matching it already made up:

y = t0x0 + t1x1.  x0 =1, y and x1 are measured and given.  We’re solving for t0, t1.

Here’s the equation with some data for the point (0, 1).

1 = t0*1 + t1*0.

Let’s say the data for all points x,y is loaded into two matching vectors of this form

y = load(‘y_vector.txt’);

x = load(‘x_vector.txt’);

How do we create the needed matrix/vectors needed to run GD in Matlab?

first, let’s make our t vector — we have 2 unknowns we need to solve for — we can set them to any random number.  We’ll use 0.

t = [0; 0]; 

Now, let’s see how many data pairs we have:

m = length(y)

Now, let’s create our X matrix:

X = [ones(m,1), x];

Now, assume we already have a gradient descent function called GD written that can handle iteration for us with this prototype: t = GD ( X, y, t, alpha, iterations ) — let’s solve for the vector t.

t = GD( X, y, t, 0.01, 1000); % does 1,000 iterations with an alpha guess of 0.01

Here’s the update rules for the GD function:

predictions = X * t; % X is a m x 2 matrix, and t is a 2 x 1 vector.  The result is a m x 1 vector that has the predicted y values from the linear equation.

temp1 = t(1) – (alpha/m * (predictions-y)’ *X(:,1));  %(predictions – y) is a m x 1 vector of errors between the linear equation and the actual data point’s y value.  using the ‘ operator transposes the vector to a 1 x m vector.  This multiplies by the m x 1 vector X(:, 1) ( which is the m x 1 vector of all ones ).  A 1 x m vector times a m x 1 vector results in a single real value, which is the partial derivative, numerically, of the square average error projected onto the axis of X1.
temp2 = t(2) – (alpha/m * (predictions-y)’ *X(:,2)); % same as above, except X(:, 2) is the mx1 vector of data values for x in our original point data set.

% do the simultaneous update

t(1) = temp1;
t(2) = temp2;

with this, you have vectorized the computation of GD in matlab!

Philosophy of GD vs. normal regression

Filed under: Uncategorized — ipeerbhai @ 10:16 pm

Back in 2010, I graduated from the UW with a degree in Economics and a certificate in econometrics.  Econometrics is a mix of statistics, linear algebra, calculus, and computer science used to solve systems of equations with real world data.  For example, if you wanted to compute house prices, econometrics would allow you to take a set of prices and information about houses, like size, number of bedrooms, permits, average wages in an area, etc — call features — and determine which of those features matter and how much they matter.   As students, we did a lot of linear regression, with two full courses focused on the theory and implementation of linear regressions.  The core computational concept used for solving regressions in this field is a matrix transpose based approach.

More recently, for fun, I’ve been learning Machine Learning (ML) — which uses a different computational concept for solving a regression system of equations.  In ML, the most common method is based on a fixed number of iterations with some magic constants and differentials on a cost function called, “Gradient Desent”(GD).  ML also uses a matrix approach, based on the same method used in econometrics which ML folks call a, “Normal Equation”.

There are a lot of differences between the methods, but both can solve the core optimization problem of reducing error between a prediction equation and actual data values.  First, the central regression equation is this:

y = A + B1*x1 + B2*x2 … + e

where y is a measured data value, A is unknown, B1… Bn is unknown, and x1…xn are known feature values.

So, here’s an example of the equation from a single data point for a house price with a single feature of house size in square feet.

$100,000 = A + B1 * (1,000 sf) + e.

You’re solving for A, B1, and sometimes e ( which can sometimes also have functions wrapping it ).  An example solution might be:

$100,000 = $99,000 + $1/sf * 1,000sf + 0.

You’re solving via the “system of equations” method that you learned in high-school algebra, except that you’re solving in matrix form.  Then, you’re using a computational matrix engine like eViews or R to solve the equation in a single step, and find all the values, even e ( which is the sum of residuals ).

In econometrics, we have error diagnostic functions to determine if our solution was any good.  We also have ways to use the error to solve some types of non-linearity in time series regressions, like auto-regression and moving averages.  There’s a lot of philosophy behind this equation and when to use it and debug it.  Both econometrics and ML are optimizing this equation when we talk about linear regressions.

In ML, some things change — the constant A ( alpha ) and the factors B1 … Bn ( betas ) are all called theta in ML.  The ML equation is changed to this:

y = t0*x0 + t1*x1 + … with no e term, and with x0 = 1.

Gradient Descent’s core method is this ( also called the partial cost function ).

Theta(j) := Theta(j) – alpha/m * Sum[(prediction-actual)*Xj]

where alpha is a magic number that the person solving the equation guesses will solve the equation ( there’s some rules on this number, but it’s just a guess. ), m is the number of data measurements, and Xj is the vector X containing the data values matching that theta.

You’re solving the equation by calling the error term a “cost value”, and creating a cost function to figure out the cost value, then minimizing this cost value using a system of updating guesses based on the derivative of the cost function.

This difference in how you generate your matrix and how you treat error is the core philosophy difference between the two disciplines.  Both are computationally optimizing an equation to fit data for unknown optimization constants — but the other fields are concerned with “goodness of fit” and use error diagnostically and have non-linear corrections available to them, where ML guesses the solution based on alpha and spatial collapses of error to a single vector per feature.  Both systems can work in non-linear functions — the format is linear and linear algebra is used for both systems — hence the name, “Linear regression”.

January 28, 2016

Notes on Encog

Filed under: Uncategorized — ipeerbhai @ 11:46 pm

I’ve been learning heaton research‘s encog machine learning framework.  This post is simple notes on how to use the framework.  As a fan of Windows, Visual Studio, and C#, I wanted a framework that was easy to learn and use with that stack.

Step 1: Prepare data.

Use any class that implements IMLData.  Here’s a quick snippet with BasicML data type.

Encog.ML.Data.IMLData myData = new Encog.ML.Data.Basic.BasicMLData(new double[] { 0.0, 1.0 });

We can also use data sets. Here’s a snippet matrix for XOR and its labeled output vector.

double[][] matrixXorInputs = {
new[] { 0.0, 0.0 },
new[] { 0.0, 1.0 },
new[] { 1.0, 0.0 },
new[] { 1.0, 1.0 }

double[][] vectorLabeledOutputs =
new[] {0.0},
new[] {1.0},
new[] {1.0},
new[] {0.0}
Encog.ML.Data.Basic.BasicMLDataSet myDataset = new Encog.ML.Data.Basic.BasicMLDataSet(matrixXorInputs, vectorLabeledOutputs);


Step 2. Prepare the network.

This gets a little more complicated.  Neural networks have layers — input layer, hidden layer(s) and output layer.  You define each layer then initialize the network to random weights.  Remember neural network theory — neural networks are searches and combinations over a topology.  Each layer allows a new set of combinations — you can sort of map the concept to the size of an exponent, with more layers increasing the maximum dimensionality of your search space.  Here’s the C# code for a simple feed forward network in low-dimensional space.

Encog.Neural.Networks.BasicNetwork myNetwork = new Encog.Neural.Networks.BasicNetwork(); // Create the network, then configure it.

myNetwork.AddLayer(new Encog.Neural.Networks.Layers.BasicLayer(null, true, 2)); // an input layer with 2 inputs and a single bias neuron

myNetwork.AddLayer(new Encog.Neural.Networks.Layers.BasicLayer(new Encog.Engine.Network.Activation.ActivationSigmoid(), true, 2)); // hidden layer — Sigmoid activation function per neuron, 1 bias neuron in layer, 2 neurons.

myNetwork.AddLayer(new Encog.Neural.Networks.Layers.BasicLayer(new Encog.Engine.Network.Activation.ActivationSigmoid(), false, 1)); // output layer — Sigmoid activation function per neuron, 0 bias neuron in layer, 1 neuron.

// Now, tell encog that I’m done declaring the net structure, and initialize the net.

myNetwork.Reset(); //random weights to start.

Step 3.  Train the network.

You can use any class that impliments IMLTrain.  Here’s a simple resilient propogation trainer:

Encog.ML.Train.IMLTrain myTrainer = new Encog.Neural.Networks.Training.Propagation.Resilient.ResilientPropagation(myNetwork, myDataset);
myTrainer.Iteration(); // run through the reslient perceptron in a single iteration.
while (myTrainer.Error > 0.01);

Step 4. Use the trained network to evaluate unknown inputs.

double[] output = new double[] { 0.0 };
myNetwork.Compute(new double[] { 0.0, 0.0 }, output); // The input to the evaluator is 0, 0.  Use whatever input you want.
Console.WriteLine(“output is {0}”, output[0]);

Usage notes:

Training can get into oscillations where it just can’t train.  The training time is highly variable — sometimes you get trained in few iterations, and sometimes, it takes a while.  In a quick sample run of the above code, I could train in 60 iterations or never train.  This is due to the learning rate — neural networks seem to suffer the same issues as any gradient descent based regression algorithm.


« Previous PageNext Page »

Create a free website or blog at