Rapid Development and Deployment of Machine Learned Models
Enterprises are being challenged with what to do with the tremendous amount of data being generated within the enterprise. If you look outside Silicon Valley, there are not too many data scientists available or not every enterprise can afford to pay the high prices to perform the analysis. Even to start a project, the cost is high and value may not be quickly realized.
In order for ML technology to be used in all kinds of enterprises (and not just tech savvy ones), new generation of platforms/tools need to be easy to use and at reasonable cost. Platforms must make it easy for even business professionals (not just data scientists) at companies to be able to use ML techniques to improve business outcomes. Business outcomes span the customer engagement journey from repeat business, to enhanced customer retention and customer satisfaction.
We have seen many recent announcements of Deep Learning technology used by Google, IBM, Microsoft and Facebook and private companies such as H2O among others and now some of those technologies are being open-sourced by enterprises. We now have more than dozen open-sourced technologies that one can leverage to get started. However, there are a few challenges, which we need to address to make platforms easy to use. Higher-level abstractions need to be defined in the platform for not only for data scientists but also for business users. It would be great, if my product manager can use the platform to find what models are running in the platform and even can build a model using the configuration I used and set up for an A/B test. Why not? Yes, big tech companies have built tools to support that but it is still a struggle for smaller departments, mid-sized companies and startups. Personally, I have experienced that challenge in large technology companies as well as in startups.
Predictive Analytics and Machine Learning are such oft-used overloaded phrases that there is a tendency to overpromise the benefits. It takes time to show value to a business and the “start small, move fast” philosophy comes in handy. So, the need to be able to start a project at a low cost is critical. Another point to note is that there is no substitute for on-line testing. Don’t use back-tested results to “overpromise” the impact to business.
The best way to approach testing is rapid iteration. Let’s look at key features required in a platform to achieve rapid development. Let me start with a cautionary note – I am not targeting tech-heavy enterprises which have big teams of data scientists, but rather enterprises which want to leverage this technology to solve their business problems and can afford a small data team to prove the worth before investing more. Cloud-based solutions from small and big companies (e.g., Amazon ML) are now available to test out new ideas at a smaller cost.
Let me start with a problem that I ran into in an ad tech area recently and consider that use case to discuss different features that are important for the ML platform. Even if I target “US” audiences, invariably we find that some percentage of traffic is being tagged outside the targeting criteria by reporting system reporting systems (such as DFA). Those fraud impressions cost real-money. We need a quick way to detect it and not bid on those suspicious impression calls via Real-Time Bidding (RTB) system. This means we need a platform that can be used to continuously update the model every hour.
So, the ability to build models faster with automation for production use at a low cost is important. Let’s look at what key features the next generation of machine learning platform should support.
Data preparation is a big effort by itself and many tools / platforms may be used to process, clean, augment, and to wrangle the data. This is a big topic by itself. Right now, 60-80% of the time is spent on data preparation. For the sake of this blog, we are assuming that data has been prepared to build the models (okay, I will come back and do a post later on data wrangling – hopefully ).
There are a couple of things ML platforms can provide for additional insight from data such as understanding the “goodness of data” for a good model fit. A data set with mostly constant data is not good – even if it is complete – it will be hard to build a good model. Simple statistical properties like outlier bounds, skew, variance, correlation, histograms can be easily computed. However, the platforms should go to the next step, i.e., provide a “data quality scorecard” at a feature level and overall. But what does a score of 86 mean? Is it good or bad? That’s where additional insights and recommendation can help. It can show the score “compared to” other similar data sets or from a configured well-known feature. The system can be trained to provide that score and even better a model can be built to generate the quality scorecard.
When one is dealing with 100’s of features, it is quite hard to review data properties – so a recommendation/hint can go a long way to understanding the data and making sure highly correlated/dependent variables are ignored from the model. (Note: Highly correlated variable will be removed in the feature reduction process)
Ease of Use in building Models
Ability to build models for common problems easily is important to platform adoption and broader support. Platform should provide solution templates so that one can get started easily. If I am building a customer churn prediction, it is not hard to build a workflow that can guide the user in easy steps. Can the past models built for the same use cases guide the user in feature engineering?
There is a wide array of algorithms such as GLM, GBM, Deep Learning, Random Forest, and are now available in most of the platforms. Platforms supporting in-memory computations are able to build the models faster and quicker at a lower cost. This is important since newer use cases need the ability to be able to adapt and build to real-time use cases and need the ability build a model frequently (every hour per say). Start with simple algorithms such as GLM and GBM; they are easier to understand and tune. Whenever a data scientist in a team comes up with a proposal to solve a problem with complex algorithms, ask them to take a pause and see how to get started with a simple algorithm first and iterate. The iteration is more important than finding the exact algorithm.
Productizing the models
Once models are built, it is critical that, they be enabled for production quickly. There is no better test than running in production with a small percentage of on-line traffic and getting some results. The quicker, it can be done, the better it is. The platform should support experimental logging of scores. This way you can get scores on your model on production traffic without impacting production application. This functionality is a much-required requirement for data scientists and will enable them to experiment models quickly.
In the past, models were built, converted to code and pushed to production system taking weeks. The new generation of SAAS-based ML platforms have integrated model building and scoring into the platform so that models can be enabled for scoring with click of a button and can be scaled easily. PMML adds portability to the model – although it is never works in an optimized way like the models that are built and scored in the same platform (optimized). So, PMML gives flexibility but sometimes at the expense of optimization – a normal tradeoff, which we encounter in other technology stacks also.
Quick iteration is the best way to know the efficacy of the model and make tuning adjustments.
Visibility & Collaboration
Data science is still a black box for many inside the company. What models have been built, what models are being used for A/B testing for a certain application, etc., are hard to get at. If you ask a question few months later, what training data was used to build the model, it is not an easy answer. Many tech-savvy companies have built proprietary tools to manage it. Data scientists are now using wide array of tools such as R, Python, H20, Spark/MLib among many others. Integration with other platforms /tools is important in providing visibility to peers and fostering collaboration. How models are built in this wide array of tools can be organized and learnings can be shared should be part of it.
A platform which make it easy to organize/tag models; allow collaboration, and keep track of changes will help speed innovation. The more open it is, the better the chance of success.
There will be complex problems for which one has to do lots of analysis, feature engineering and build sophisticated algorithms but there are classes of problems for which using simple algorithms and solutions will be good enough. The platform should be easy enough for product managers / business analysts to be able to build models. They should also be able to leverage model configurations from their data scientists to play around with new data. It should be easy to compare scores of multiple models to see how the new model stacks up. Providing model templates to common set of problems / verticals can help new users to leverage the platform better.
Data drift: Adaptive to data change & Model Versioning
In most organizations, retraining of models is scheduled, once in 6 weeks or a quarter. These days’ data is changing at a much faster rate and it is important to leverage it sooner. So, the platform needs to provide important data characteristics changes and feature level. These point to data pipeline issues which needs to be addressed sooner since it impacts the model performance.
It will be a good tool to compare differences between two models; configuration, and feature differences. It will be a good analysis tools to understand how data is changing over time and the impact of it.
Note, tech-savvy companies will have lots of tools and a big team of data scientists and they will build custom tools – we are not talking about them. We are talking about many companies which cannot afford a big data science team and are not in the technology area, and they need tools which are simple to use, can help speed adoption of machine leaning into their business. Cloud based SAAS platform are the best way to get started at a lower cost.
At ScoreData, we have built ScoreFast™, a cloud based platform that is geared for such businesses – simple to use at lower cost. Once a model is built, it can be enabled by a single click for scoring. The model is optimized for speed. The models can be shared among peers so that they can see what features are being used as well as leverage the configuration to build models using their data. Configuring a data quality ScoreFast™ Scorecard for each feature and of the overall data set with recommendations to the modeler.
The next generation of ML platforms will make it more transparent, collaborative and easier to use at a lower cost.