Category Archives: Aaron

Lessons in ML Optimization

One of the “fun” parts of working in ML for someone with a background in software development and not academic research is lots of hard problems remain unsolved. There are rarely defined ways things “must” be done, or in some cases even rules of thumb for doing something like implementing a production capable machine learning system for specific real world problems.

For most areas of software engineering, by the time it’s mature enough for enterprise deployment, it has long since gone through the fire and the flame of academic support, Fortune 50 R&D, and broad ground-level acceptance in the development community. It didn’t take long for distributed computing with Hadoop to be standardized for example. Web security, index systems for search, relational abstraction tiers, even the most volatile of production tier technology, the JavaScript GUI framework goes through periods of acceptance and conformity before most large organizations are trying to roll it out. It all makes sense if you consider the cost of migrating your company from a legacy Struts/EJB3.0 app running on Oracle to the latest HTML5 framework with a Hadoop backend. You don’t want to spend months (or years) investing in a major rewrite to find that its entirely out of date by your release. Organizations looking at these kinds of updates want an expectation of longevity for their dollar, so they invest in mature technologies with clear design rules.

There are companies that do not fall in this category for sure… either small companies who are more agile and can adopt a technology in the short term to retain relevance (or buzzword compliance), who are funded with external research dollars, or who invest money to stay pushing the bleeding edge. However, I think it’s fair to say, the majority of industry and federal customers are looking for stability and cost efficiency from solved technical problems.

Machine Learning is in the odd position of being so tremendously useful in comparison to prior techniques that companies who would normally wait for the dust to settle and development and deployment of these capabilities to become fully commoditized are dipping their toes in. I wrote in a previous post how a lot of the problems with implementing existing ML algorithms boils down to lifecyle, versioning, deployment, security etc., but there is another major factor which is model optimization.

Any engineer on the planet can download a copy of Keras/TensorFlow and a CSV of their organization’s data and smoosh them together until a number comes out. The problem comes when the number takes an eternity to output and is wrong. In addition to understanding the math that allows things like SGD to work for backpropogation or why certain activation functions are more effective in certain situations… one of the jobs for data scientists tuning DNN models is to figure out how to optimize the various buttons and knobs in the model to make it as accurate and performant as possible. Because a lot of this work *isn’t* a commodity yet, it’s a painful learning process of tweaking the data sets, adjusting model design or parameters and rerunning and comparing the results to try and find optimal answers without overfitting. Ironically the task data scientists are doing is one perfectly suited to machine learning. It’s no surprise to me that Google developed AutoML to optimize their own NN development.

 

A number of months ago Phil and I worked on an unsupervised learning task related to organizing high dimensional agents in a medical space. These entities were complex “polychronic” patients with a wide variety of diagnosis and illness. Combined with fields for patient demographic data as well as their full medical claim history we came up with a method to group medically similar patients and look for statistical outliers for indicators of fraud, waste, and abuse. The results were extremely successful and resulted in a lot of recovered money for the customer, but the interesting thing technically was how the solution evolved. Our first prototype used a wide variety of clustering algorithms, value decompositions, non-negative matrix factorization, etc looking for optimal results. All of the selections and subsequent hyperparameters had to be modified by hand, the results evaluated, and further adjustments made.

When it became clear that the results were very sensitive to tiny adjustments, it was obvious that our manual tinkering would miss obvious gradient changes and we implemented an optimizer framework which could evaluate manifold learning techniques for stability and reconstruction error, and the results of the reduction clustered using either a complete fitness landscape walk, a genetic algorithm, or a sub-surface division.

While working on tuning my latest test LSTM for time series prediction, I realized we’re dealing with the same issue here. There is no hard and fast rule for questions like, “How many LSTM Layers should my RNN have?” or “How many LSTM Units should each layer have?”, “What loss function and optimizer work best for this type of data?”, “How much dropout should I apply?”, “Should I use peepholes?”

I kept finding articles during my work saying things like, “There are diminishing returns for more than 4 stacked LSTM layers”. That’s an interesting rule of thumb… what is it based on? The author’s intuition based on the data sets for the particular problems they were experiencing presumably. Some rules of thumb attempted to generate a mathematical relationship between the input data size and complexity and the optimal layout of layers and units. This StackOverflow question has some great responses: https://stackoverflow.com/questions/35520587/how-to-determine-the-number-of-layers-and-nodes-of-a-neural-network

A method recommended by Geoff Hinton is to add layers until you start to overfit your training set. Then you add dropout or another regularization method.

Because so much of what Phil and I do tends towards the generic repeatable solution for real world problems, I suspect we’ll start with some “common wisdom heuristics” and rapidly move towards writing a similar optimizer for supervised problems.

Intro to LSTMs with Keras/TensorFlow

As I mentioned in my previous post, one of our big focuses recently has been on time series data for either predictive analysis or classification. The intent is to use this in concert with a lot of other tooling in our framework to solve some real-world applications.

One example is a pretty classic time series prediction problem with a customer managing large volumes of finances in a portfolio where the equivalent of purchase orders are made (in extremely high values) and planned cost often drifts from the actual outcomes. The deltas between these two are an area of concern for the customer as they are looking for ways to better manage their spending. We have a proof of concept dashboard tool which rolls up their hierarchical portfolio and does some basic threshold based calculations for things like these deltas.

A much more complex example we are working on in relationship to our trajectories in belief space is the ability to identify patterns of human cultural and social behaviors (HCSB) in computer mediated communication to look for trustworthy information based on agent interaction. One small piece of this work is the ability to teach a machine to identify these agent patterns over time. We’ve done various unsupervised learning which in combination with techniques such as dynamic time warping (DTW) have been successful at discriminating agents in simulation, but has some major limitations.

For many time series problems a very effective method of applying deep learning is using Recurrent Neural Networks (RNN) which allow history of the series to help inform the output. This is particularly important in cases involving language such as machine translation or autocompletion where the context of the sentence may be formed by elements spoken earlier in the text. Convolutional networks (CNNs) are most effective when the tensor elements have a distinct positional meaning in relationship to each other. The most common examples is a matrix of pixel values where the value of the pixel has a direct relevance to nearby pixels. This allows for some nice parallelization, and other optimizations because you can make some assumptions that a small window of pixels will be relevant to each other and not necessarily dependent on “meaning” from pixels somewhere else in the picture. This is obviously a very simplified explanation, and there are lots of ways CNNs are being expanded to have broader applications including for language.

In any case, despite recent cases being made for CNNs being relevant for all ML problems: https://arxiv.org/abs/1712.09662 the truth is RNNs are particularly good at sequentially understood problems which rely on the context of the entire series of data. This is of course useful for time series data as well as language problems.

The most common and popular example of RNN implementation for this is the Long Short-Term Memory (LSTM) RNN. I won’t dive into all of the details of how LSTMs work under the covers, but I think its best understood by saying: While in a traditional artificial neural network each neuron has a single activation function that passes a single value onward, LSTMs have units (or cells in some literature) which are more complex consisting most commonly of  a memory cell, an input gate, an output gate and a forget gate. For a given LSTM layer, it will have a configured amount of fully connected LSTM units, each of which contains the above pieces. This allows each unit to have some “memory” of previous pieces of information, which helps the model to factor in things such as language context or patterns in the data occurring over time. Here is a link for a more complete explanation: http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Training LSTMs isn’t much different than training any NN, it uses backpropogation against a training and validation set with configured hyperparemeters and the layout of the layers having a large effect on the performance and accuracy. For most of my work I’ve been using Keras & TensorFlow to implement time series predictions. I have some saved code for doing time series classification, but it’s a slightly different method. I found a wide variety of helpful examples early on, but they included some not obvious pitfalls.

Dr. Jason Brownlee at MachineLearningMastery.com has a bunch of helpful introductions to various ML concepts including LSTMs with example data sets and code. I appreciated his discussion about the things which the tutorial example doesn’t explicitly cover such as non-stationary data without preprocessing, model tuning, and model updates. You can check this out here: https://machinelearningmastery.com/time-series-forecasting-long-short-term-memory-network-python/

Note: The configurations used in this example suffices to explain how LSTMs work, but the accuracy and performance isn’t good. A single layer of a small number of LSTM cells running a large number of epochs of training results in pretty wide swings of predictive values which can be demonstrated by running a number of runs and comparing the changes in the RMSE scores which can be wildly off run-to-run.

Dr. Brownlee does have additional articles which go into some of the ways in which this can be improved such as his article on stacked LSTMs: https://machinelearningmastery.com/stacked-long-short-term-memory-networks/

Jakob Aungiers (http://www.jakob-aungiers.com/) has the best introduction to LSTMs that I have seen so far. His full article on LSTM time series prediction can be found here: http://www.jakob-aungiers.com/articles/a/LSTM-Neural-Network-for-Time-Series-Prediction while the source code (and a link to a video presentation) can be found here: https://github.com/jaungiers/LSTM-Neural-Network-for-Time-Series-Prediction

His examples are far more robust including stacked LSTM layers, far more LSTM units per layer, and well characterized sample data as well as more “realistic” stock data. He uses windowing, and non-stationary data as well. He has also replied to a number of comments with detailed explanations. This guy knows his stuff.

 

 

Latest DNN work

It’s been a while since I’ve posted my status, and I’ve been far too busy to include all of the work with various AI/ML conferences and implementations, but since I’ve been doing a lot of work specifically on LSTM implementations I wanted to include some notes for both my future self, and my partner when he starts spinning up some of the same code.

Having identified a few primary use cases for our work; high dimensional trajectories through belief space, word embedding search and classification, and time series analysis we’ve been focusing a little more intently on some specific implementations for each capability. While Phil has been leading the charge with the trajectories in belief space, and we both did a bunch of work in the previous sprint preparing for integration of our word embedding project into the production platform, I have started focusing more heavily on time series analysis.

There are a variety of reasons that this particular niche is useful to focus on, but we have a number of real world / real data examples where we need to either perform time series classification, or time series prediction. These cases range from financial data (such as projected planned/actual deltas), to telemetry anomaly detection for satellites or aircraft, among others. In the past some of our work with ML classifiers has been simple feed forward systems (classic multi layer perceptrons), naive Bayesian, or logistic regression.

I’ve been coming up to speed on deep learning, becoming familiar with both the background, and mathematical underpinings. Btw, for those looking for an excellent start to ML I highly recommend Patrick Winston (MIT) videos: https://youtu.be/uXt8qF2Zzfo

Over the course of several months I did pretty constant research all the way through the latest arXiv papers. I was particularly interested in Hinton’s papers on capsule networks as it has some direct applicability to some of our work. Here is a article summing up the capsule networks: https://medium.com/ai%C2%B3-theory-practice-business/understanding-hintons-capsule-networks-part-i-intuition-b4b559d1159b

I did some research into the progress of current deep learning frameworks as well, looking specifically at examples which were suited to production deployment at scale over frameworks most optimal for single researchers solving pet problems. Our focus is much more on the “applied ML” side of things rather than purely academic. The last time we did a comprehensive deep learning framework “bake off” we came to a strong conclusion that Google TensorFlow was the best choice for our environment, and my recent research validated that assumption was still correct. In addition to providing TensorFlow Serving to serve your own models in production stacks, most cloud hosting environments (Google, AWS, etc) have options for directly running TF models either serverless (AWS lambda functions) or through a deployment/hosting solution (AWS SageMaker).

The reality is that lots of what makes ML difficult boils down to things like training lifecycle, versioning, deployment, security, and model optimization. Some aspects of this are increasingly becoming commodity available through hosting providers which frees up data scientists to work on their data sets and improving their models. Speaking of models, on our last pass at implementing some TensorFlow models we used raw TensorFlow I think right after 1.0 had released. The documentation was pretty shabby, and even simple things weren’t super straightforward. When I went to install and set up a new box this time with TensorFlow 1.4, I went ahead and used Keras as well. Keras is an abstraction API over top of computational graph software (either TensorFlow default, or Theano). Installation is easy, with a couple of minor notes.

Note #1: You MUST install the specific versions listed. I cannot stress this enough. In particular the cuDNN and CUDA Toolkit are updated frequently and if you blindly click through their download links you will get a newer version which is not compatible with the current versions of TensorFlow and Keras. The software is all moving very rapidly, so its important to use the compatible versions.

Note #2: Some examples may require the MKL dependency for Numpy. This is not installed by default. See: https://stackoverflow.com/questions/41217793/how-to-install-numpymkl-for-python-2-7-on-windows-64-bit which will send you here for the necessary WHL file: https://www.lfd.uci.edu/~gohlke/pythonlibs/#numpy

Note #3: You will need to run the TensorFlow install as sudo/administrator or get permission errors.

Once these are installed there is a full directory of Keras examples here: https://github.com/keras-team/keras/tree/master/examples

This includes basic examples of most of the basic DNN types supported by Keras as well as some datasets for use such as MNIST for CNNs. When it comes to just figuring out “does everything I just installed run?” these will work just fine.

 

Aaron 4.25.17

  • Wasted a ton of time today tracking down progress of integration of additional teams into our program.
  • Spent a couple of hours tackling a poster presentation to be delivered at a technical leadership summit next week. I’ll be presenting the “Advanced Analytics” presentation and discussing all of our tools, and capabilities. Phil helped a lot, and I ended up quite pleased with the results. One of the nice things is we were able to include screenshots of actual tools and graphs of the data we’re using. I think this will be a nice difference from the rest of the presenters.
  • Did some good pair programming with Phil on the Pandas DataFrame.sort issue, moved to the non-deprecated version of DataFrame.sort_values and got it working correctly at all matrix sizes.

Aaron 4.3.17

  • ML Architecture
    • Spent a bunch of time last Friday meeting with Phil to discuss the proposed path for the Machine Learning epics to develop the research browser.
    • Our plan uses a thin-client Angular 2 app for the bulk of the annotation/tagging process, with an optional companion browser plugin developed later to do in-document tagging, which will capture the URL, and snippet text.
    • We’re intending to a simple Naive Bayesian classifier for document categories; and to use more complex classifiers (DNNs) for snippet content and user behaviors in the future.
    • Given this we’re feeling pretty confident about the proposed timeframe. It’s unclear how we’re implement the Bayesian Classifier, since it’s already been developed in Weka/Java, it may not be in our best interests to re-write it into a Python-based version.
  • Python integration
    • Using ProcessBuilder works for the simple case where we want to do essentially batch clustering, but it is very difficult to debug in CI/Prod instances as it becomes a “black box”. There are methods to make it more communicative, but we should investigate looking at a Python based WSO2 secured microservice. It would make it far easier to integrate Python code into our stack.
    • I looked at multiple methods to do HDFS integration using Python, and found some canonical recent examples with Python 3.x.
  • Hadoop is dead, long live ML?
  • ClusteringService
    • Reviewed the MapReduce code for the service. It’s pretty straightforward, using the mapper to build the row data and the reducer to format it for output.
    • The actual table it needs to pull from is currently missing… so tests do not pass if set to the real table, but once my new laptop is loaded I will be able to make changes.

Aaron 3.21.17

Missed my blog yesterday as I got overwhelmed with a bunch of tasks. I’ll include some elements here:

  • KeyGeneratorLibrary
    • I got totally derailed for multiple hours as one of the core libraries we use throughout the system to generate 128-bit non-crypto hashes for things like rowIds had gotten thoroughly dorked up. Someone had accidentally dumped 70 mb of binary unstructured content into the library and checked it in.
    • While I was clearing out all the binary content, I was asked to remove all of the unused dependencies from our library template. All of our other libraries include SpringBoot and a bunch of other random crap, but I took the time to rip it all out and build a new version, and update our Hadoop jobs to use the latest one. The combined changes dropped the JAR from ~75 mb to 3k. XD
  • Hadoop Development
    • More flailing wildly trying to get our Hadoop testing and development process fixed. We’re on a new environment, and essentially it broke everything, so we have no way to develop, update, or test any of our Hadoop code.
    • Apparently this has been fixed (again).
  • TensorFlow / Sci-Py Clustering
    • Sat in with Phil for a bit looking at his latest fancy code and the output of the clusters. Very impressive, and the code is nice and clean. I’m really looking forward to moving over to predominantly Python code. I’m super burned out on Java right now, and would far rather be working on pure machine learning content rather than infrastructure and pre-processing. Maybe next sprint?
  • TFRecord Output
    • Got a chance to write a playground for TFRecord output and Python integration, before realizing that the TF ecosystem code only supports InputFormat/OutputFormat for Hadoop, and due to our current issues I cannot run those tests locally at all. *sad trombone*
  • Python Integration
    • My day is rapidly winding to a close, but slapping out the test code for the Python process launching so I can at least feel like I accomplished something today.
  • Cycling / Health
    • Didn’t get to cycle today because I spent 2 hours trying to get a blood test so my doctor can verify my triglycerides have gone down.

Aaron 3.17.17

  • Hadoop Environment
    • More fun discussions on our changes to Hadoop development today. Essentially we have a DevOps box with a baby Hadoop cluster we can use for development.
  • ClusteringService scaffold / deploy
    • I spent a bit of time today building out the scaffold MicroService that will manage clustering requests, dispatch the MapReduce to populate the comparison tensor, and interact with the TensorFlow Python.
    • I ran into a few fits and starts with syntax issues where the service name was causing fits because of errant “-“. I resolved those and updated the dockerfile with the new TensorFlow docker image. Once I have a finished list of the packages I need installed for Python integration I’ll have to have them updated to that image.
    • Bob said he would look at moving over the scaffold of our MapReduce job launching code from a previous service, and I suggested he not blow away all the work I had just done and copy the as needed pieces in.
  • TFRecord output
    • Trying to complete the code for outputting MapReduce results as a TFRecord protobuff objects for TensorFlow.
    • I created a PythonIntegrationPlayground project with an OutputTFRecord.java class responsible for building a populated test matrix in a format that TensorFlow can view.
    • Google supports this with their ecosystem libraries here. The library includes instructions with versions and a working sample for MapReduce as well as Spark.
    • The frustrating thing is that presumably to avoid issues with version mismatches, they require you to compile your own .proto files with the protoc compiler, then build your own JAR for the ecosystem.hadoop library. Enough changes have happened with protoc and how it handles the locations of multiple inter-connected proto files that you absolutely HAVE to use the locations they specify for your TensorFlow installation or it will not work. In the old days you could copy the .proto files local to where you wanted to output them to avoid path issues, but that is now a Bad Thing(tm).
    • The correct commands to use are:
      • protoc –proto_path=%TF_SRC_ROOT% –java_out=src\main\java\ %TF_SRC_ROOT%\tensorflow\core\example\example.proto
      • protoc –proto_path=%TF_SRC_ROOT% –java_out=src\main\java\ %TF_SRC_ROOT%\tensorflow\core\example\feature.proto
    • After this you will need Apache Maven to build the ecosystem JAR and install so it can be used. I pulled down the latest (v3.3.9) from maven.apache.org.
    • Because I’m a sad, sad man developing on a Windows box I had to disable to Maven tests to build the JAR, but it’s finally built and in my repo.
  • Java/Python interaction
    • I looked at a bunch of options for Java/Python interaction that would be performant enough, and allow two-way communication between Java/Python if necessary. This would allow the service to provide the location in HDFS to the TensorFlow/Sci-Kit Python clustering code and receive success/fail messages at the very least.
    • Digging on StackOverflow lead me to a few options.
    • Digging a little further I found JPServe, a small library based on PyServe that uses JSON to send complex messages back to Java.
    • I think for our immediate needs its most straightforward to use the ProcessBuilder approach:
      • ProcessBuilder pb = new ProcessBuilder(“python”,”test1.py”,””+number1,””+number2);
      • Process p = pb.start();
    • This does allow return codes, although not complex return data, but it avoids having to manage a PyServe instance inside a Java MicroService.
  • Cycling
    • I’ve been looking forward to a good ride for several days now, as the weather has been awful (snow/ice). Got up to high 30s today, and no visible ice on the roads so Phil and I went out for our ride together.
    • It was the first time I’ve been out with Phil on a bike with gears, and its clear how much I’ve been able to abuse him being on a fixie. If he’s hard to keep up with on a fixed gear, its painful on gears. That being said, I think I surprised him a bit when I kept a 9+ mph pace up the first hill next to him and didn’t die.
    • My average MPH dropped a bit because I burned out early, but I managed to rally and still clock a ~15 mph average with some hard peddling towards the end.
    • I’m really enjoying cycling. It’s not a hobby I would have expected would click with me, but its a really fun combination of self improvement, tenacity, min-maxing geekery, and meditation.

Aaron 3.13.17

  • Sprint Review
    • Covered issues with having customers present with Sprint Reviews; ie. don’t do it, it makes them take 3x as long and cover less.
    • Alternative facts presented about design tasks.
  • ClusteringService
    • Send design content to other MapReduce developer.
    • Sent entity model queries out regarding claim data.
  • Cycling
    • I went out for the 12.5 mile loop today. It was 30 degrees with a 10-12 mph wind, but it was… easy? I didn’t even lose my breath going up “Death Hill”. I guess its about time to move onto the 15 mile loop for lunchtime rides.
  • Sprint Grooming / Sprint Planning
    • It was decided to roll directly from grooming to planning activities.

Aaron 3.6.17

  • TensorFlow
    • Didn’t get to do much on this today; Phil is churning away learning matrix operations and distance calculations to let us write a DBSCAN plug-in
  • Architecture
    • Drawing up architecture document with diagram

Aaron 3.3.17

  • Architecture Status
    • Sent out the reasonably in-depth write-up of the proposed design for complex automatic clustering yesterday and expected to get at least a few questions or comments back; I ended up having to spend far more of my day than I wanted responding.
    • The good news is that the overall design is approved and one of our other lead MapReduce developers is up to speed on what we need to do. I’ll begin sending him some links and we’ll follow up on starting to generate code in between the sprints.
  • TensorFlow
    • I haven’t gotten even a fraction of the time spent researching this that I wanted, so I’m well behind the learning curve as Phil blazes trails. My hope is that his lessons learned can help me come up to speed more quickly.
    • I’m going to continue some tutorials/videos today to get caught up so next week I can chase down the Protobuff format I need to generate with the comparison tensor.
    • I did get a chance to watch some more tutorials today covering cross-entropy loss method that made a lot of sense.
  • Cycling
    • I went for a brief ride today (only 5 miles) and managed to fall off my bike for the first time. I went to stop at an intersection and unclipped my left foot fine, when I went to unclip my right foot, the cold-weather boot caught on the pedal and sent me crashing onto the curb. Fortunately I was bundled up enough that I didn’t get badly hurt, just bent my thumbnail back. Got back on the bike and completed the rest of the ride. I was still too sore to do the 12.5 mile today, especially in 20 mph winds.

Aaron 3.2.17

  • TensorFlow
    • Started the morning with 2 hours of responses to client concerns about our framework “bake-off” that were more about their lack of understanding machine learning and the libraries we were reviewing than real concerns. Essentially the client liaison was concerned we had elected to solve all ML problems with deep neural nets.
    • [None, 784] is a 2D tensor of any number of rows with 784 dimensions (corresponding to total pixels)
    • W,b are weights and bias (these are added as Variables which allow the output of the training to be re-entered as inputs) These can be initiated as tensors full of 0s to start.
    • W has a shape of [784,10] because we want evidence of each of the different classes we’re trying to solve for. In this case that is 10 possible numbers. b has a shape of 10 so we can add its results to the output (which is the probability distribution via softmax of those 10 possible classes equalling a total of 1)
  • ETL/MapReduce
    • Made the decision to extract the Hadoop content from HBase via a MicroService and Java, build the matrix in Protobuff format, and perform TensorFlow operations on it then. This avoids any performance concerns about hitting our event table with Python, and lets me leverage the ClusteringService I already wrote the framework for. We also have an existing design pattern for MapReduce dispatched to Yarn from a MicroService, so I can avoid blazing some new trails.
  • Architecture Design
    • I submitted an email version of my writeup for tensor creation and clustering evaluation architecture. Assuming I don’t get a lot of pushback I will be able to start doing some of the actual heavy lifting and get some of my nervousness about our completion date resolved. I’d love to have the tensor built early so that I could focus on the TensorFlow clustering implementation.
  • Proposal
    • More proposal work today… took the previously generated content and rejiggered it to match the actual format they wanted. Go figure they didn’t respond to my requests for guidance until the day before it was due… at 3 PM.

Aaron 3.1.17

  • TensorFlow
    • Figuring out TensorFlow documentation and tutorials (with a focus on matrix operations, loading from hadoop, and clustering).
    • Really basic examples with tiny data sets like linear regression with gradient descent optimizers are EASY. Sessions, variables, placeholders, and other core artifacts all make sense. Across the room Phil’s hair is getting increasingly frizzy as he’s dealing with more complicated examples that are far less straightforward.
  • Test extraction of Hadoop records
    • Create TF tensors using Python against HBASE tables to see if the result is performant enough (otherwise recommend we write a MapReduce job to build out a proto file consumed by TF)
  • Test polar coordinates against client data
    • See if we can use k-means/DBSCAN against polar coordinates to generate the correct clusters with known data). If we cannot use polar coordinates for dimension reduction, what process is required to implement DBSCAN in TensorFlow?
  • Architecture Diagram
    • The artifacts for this sprint’s completion are architecture diagrams and proposal for next sprint’s implementation. I haven’t gotten feedback from the customer about our proposed framework, but it will come up in our end-of-sprint activities. Design path and flow diagram are due on Wednesday.
  • Cycling
    • I did my first 15.2 mile ride today. My everything hurts, and my average speed was way down from yesterday, but I finished.

Aaron 2.28.17

9:00 – BRC

  • TensorFlow
    • Installed following TF installation guide.
    • Found issues with the install instructions almost immediately. Found this link  with a suggestion that I followed to get it installed.
    • Almost immediately found that the Hello World example succeeded with a list of errors. Apparently its a known issue for the release candidate which was just fixed in the nightly build as per this link.
    • I haven’t had a chance to try it yet, but found a good Reddit link for a brief TF tutorial.
    • I went through the process of trying to get my IntelliJ project to connect and be happy with the Python interpreter in my Anaconda install, and although I was able to RUN the TF tutorials, it was still acting really wacky for features like code completion. Given Phil was able to get up and running with no problems doing a direct pip install to local Python, I scrapped my intent to run through Anaconda and did the local install. Tada! Everything is working fine now.
  • Unsupervised Learning (Clustering)
    • Our plan is to implement our unsupervised learning for the IH customer in an automated fashion by writing a MR app dispatched by MicroService that populates a Protobuf matrix for TensorFlow.
    • The trick about this is that there is no built in density-based clustering algorithm native for TF like the DBSCAN we used on last sprint’s deliverable. TF supports K-Means “out of the box” but with the high number of dimensions in our data set this isn’t ideal. Here is a great article explaining why.
    • However, one possible method of successfully utilizing K-Means (or improving the scalability of DBSCAN is to convert our high dimensional data to polar coordinates. We’ll be investigating this once we’ve comfortable with TensorFlow’s matrix math operations.
  • Proposal Work
    • Spent a fun hour of my day converting a bunch of content from previous white-papers and RFI documents into a one-page write-up of our Cognitive Computing capabilities. Ironically the more we have to write these the easier it gets because I’ve already written it all before. Also more importantly as time goes by more and more of the content describes things we’ve actually done instead of things we have in mind to do.