Assessing Google's ML Kit For iOS and Android

On May 9th, Google announced ML Kit for Firebase. Aimed at ordinary developers, rather than data scientists, this release brings machine learning functionality direct to both iOS and Android devices.

Six features headline the release:

Facial recognition
Barcode recognition
Image labeling
Landmark recognition
Text recognition
Custom Models

Like Apple's CoreML, Google engineered ML Kit to run machine learning models on a device, not to build those models. So ML Kit will not make your phone smarter, but it certainly might make your app appear more responsive to your users.

When linked with Google's Firebase backend, even more computationally intensive models can deliver impressive accuracy. Not all features of ML Kit can run on both cloud and device, though. Here's a breakdown:

ML Kit is important for two reasons. For the typical CapTech client: knowledge of mobile development far outstrips knowledge of machine learning. Almost every mobile developer has played around with Firebase. Few have waded into the thicket of convoluted neural networks, which really are as convoluted as the name suggests.

Second, by bundling prebuilt models for common tasks like barcode and text recognition, Google is making it easy to generate value with machine learning. Five years from now there will much more machine learning maturity in the enterprise. In the meantime, we can explore the uses of machine learning while still keeping the agile train rolling. At the very least, developers will be less dependent upon the sometimes dodgy third party frameworks needed to complete basic tasks like barcode reading.

Think of Device as API

Here is what the world of mobile looked like just a few years ago.

The user would communicate with the device in one of two ways: voice or finger gestures. Sometimes those finger gestures would include filling out long forms using a ridiculously small keyboard. Think of the mobile keyboard as a restful API for receiving POST requests. Except that it is an API that can only accept one character at a time! It's annoying, inefficient, and slow, and we are still doing it all the time. Things might have been faster with a Palm Pilot, assuming you could learn this:

Now, with digital assistants like Siri and Google's Assistant, voice commands can fill in text fields and forms much more efficiently. But the accuracy just is not there yet. Data entry for the enterprise often requires jargon words, proper nouns, and mixes and letters and numbers. For business critical data entry, users know that they use their voice at their own risk. Voice is inherently noisy: nobody wants to read out their credit card number in the coffee shop.

For many cases, the user can input large chunks of data quickly and accurately with a barcode.

Both Apple and Google have supported barcode entry for some time. But the complexity has been too much for many development teams. This leads back to the dependency problem on frameworks like zxing

This is what makes ML Kit so intriguing. Firebase is well-respected and well-supported. Developers are already familiar with it. What's more, barcode scanning happens only on device - nothing gets transmitted to Google.

World to API: "Hello!"

With machine learning, the basic mobile picture changes again, to something more like this:

Once again, think of the mobile device as a restful server. With augmented reality and machine learning, the world is posting itself to the device through the lens of the device's camera. Perhaps with help from the cloud, perhaps without it, the device returns an augmented version of the world to the user. "Hello there, user, I'm the world."

Take ML Kit's textual recognition for example. The promise of machine learning is that this task can be accomplished more quickly and accurately than ever before. As I argued in my blog last year , there is no magic here. The results of machine learning are probabilistic, not mechanistic. Developers find machine learning hard because they don't tend to think in probabilistic terms. We want the correct answer every time, not 98% of the time. Those old-school overpriced OCR tools we used to use were written by deterministic programmers using deterministic code. No estimate of confidence required.

Even more importantly, pixel-clustering OCR tools have no potential to improve with time -without rewriting the code. With ML, the model can evolve with new data and thus drive the accuracy percentage closer to 100%.

Success Metric: Speed + Accuracy = Trust

These are very, very early days for ML Kit. Over the next few weeks, CapTech's devs will be digging into the details of Firebase ML Kit. As I survey our portfolio of enterprise mobile apps, I see major financial institutions, resorts, large retail chains, government agencies, among others. Each one of these has an incentive to make it easier for the user to communicate something complex. And reciprocally, each has an incentive to make their world comprehensible to the user.

Our clients are repositories of trust for their users. Accuracy and speed build trust; sloppiness degrades it quickly. So machine learning offers an opportunity to build that trust.

In the remainder of this blog, I'll explore the sample iOS apps Google provides for ML Kit. Hopefully, I'll spare you some of the headaches I encountered along the way.

When integrating ML into a mobile app, measure the value-added against the following questions:
How intrusive is the required third party framework? Does it require a change to build and deployment?
How large is the framework? To what degree will it bloat my app bundle?
How do I measure expected cost of a cloud-based machine learning framework? * Does the API insulate developers from machine learning code?
In addition to returning results, does it return a confidence level in those results?

This last question is most important. As I mentioned above, machine learning models produce probabilistic outcomes, not deterministic ones. If the ML API doesn't return a confidence level, the app has no means to manage the tradeoff between false positives and false negatives.

First Look: Text Recognition

Setting up a basic Firebase account is painless. No money out-of-pocket is required. This should make it easy to explore ML Kit functionality. Start by setting up a new project at Firebase. For ML features that require the Cloud Vision API, the Firebase project needs to be on the pay-as-you-go Blaze plan. Download the sample app here. Be sure to update the GoogleService-Info.plist file in the sample app with the one from your Firebase project.

Firebase still requires CocoaPods rather than Carthage or Swift Package Manager. If your Cocoapods install is outdated, as mine was, you'll probably need to upgrade it. Then, make sure to run pod install with the repo update option, like this:

pod install --repo-update

For text recognition, the burden of embedded frameworks is not bad. Here's what I found when I cracked open the package.

So Google provides full text recognition capabilities for under 1.5MB. Not bad. When you run the sample app, you'll see:

The sample app is clever. It is designed to recognize images from Assets.xcassets directory. So you can throw as many images as you like into the bundle and test on the simulator. The two options at the bottom allow you test the recognition either on the device or using the Cloud Vision API.

Unfortunately, even after signing up for the Blaze plan, I wasn't able to authenticate with the Firebase backend, even after copying the GoogleService-Info.plist. So I wasn't able to test out the Cloud Vision API. I did get acceptable results from the device-only recognition engine. Here you can see how the it did against a parking pass for a Richmond Flying Squirrels baseball game. Not bad.

Now that looks cool, but you'll note that the engine missed on the Flying Squirrels logo in the upper left corner. To activate the engine, the app makes the following call:

func runTextRecognition(with image: UIImage) {
 let visionImage = VisionImage(image: image)
 textDetector.detect(in: visionImage) { features, error in
 self.processResult(from: features, error: error)
 }
 }

The functiontextDetector.detect(in: FIRVisionImage, completion: FIRVisionTextDetectionCallback) does not give the developer an option to specify a confidence value. It is simply image -> Text. Developers and product owners may not understand neural networks, but they do understand confidence levels. It is disappointing the the ML Kit framework excludes this in its return values.

Deeper Dive: Custom Learning Model

One of the attractions of Firebase ML is that custom learning models are available. To give this a try, I followed the instructions here. As suggested I set up a python virtual environment then pulled in the tensorflow library using the pip dependency management tool.

The Towards Data Science blog by Sagar Sharma offers an excellent introduction to getting started with custom machine learning models. I'm not going to repeat all the steps here. Instead, I'll focus on the high points.

I wanted to see how hard it would be to train a model to recognize individual faces. So I gathered up a handful of CapTechers and took twenty or so pictures of each person. Here's a sample:

Each of these I fed into the model. Tensor flow made it easy to train model. All I had to do was:
Resize each picture to 299 x 299 * Group the pictures by person: pictures from each person into their own directory.
Run the following retraining script.

python scripts/retrain.py --output<em>graph=tf</em>files/retrained<em>graph.pb \
--output</em>labels=tf<em>files/retrained</em>labels.txt \
--image<em>dir=tf</em>files/people

It took about two hours to train the model on my hulking 12 Core 2012 Mac Pro. The model generation process output two files: retrained_graph.pb and retrained_labels.txt. From here, I was able to test the model against new pictures using a script like this.

python scripts/label<em>image.py --image /Users/mbroski/Desktop/CT/IMG</em>0837.jpg

And this produced the following delightful output. Evaluation time (1-image): 0.890s

mbroski (score=0.90557)
apazylbekov (score=0.04920)
cteegarden (score=0.01493)
thughes (score=0.01230)
mluansing (score=0.00900)

Notice the confidence numbers. This is what I was looking for! From here, getting the model running on device is simply a matter of packaging. This code builds the model that would be embedded in the app.

python -m tensorflow.python.tools.optimize<em>for</em>inference \
--input=tf<em>files/retrained</em>graph.pb \
--output=tf<em>files/optimized</em>graph.pb \
--input<em>names="input" \
--output</em>names="final_result"

Unfortunately, I wasn't able to get model to build to the point of testing it on the device. However, the Firebase docs indicate that once the model is running, I will be able to assess the inference probability with a call like this:

// Get first and only output of inference with a batch size of 1
let probabilities = try? outputs.output(index: 0)

That last result is the most encouraging. The custom model outputs enough to apply business logic to the inference.

Conclusion

It is easy to imagine scenarios where machine learning on mobile makes sense. Done well, the world becomes an input device for the user. Barcodes and text recognition can replace long, error-prone, and frustrating text entry forms.

At the same time, ML is the foundation behind augmented reality. Both Apple and Google will continue to provide out of the box solutions for the most common use cases. But the tradeoffs with simpler solutions are severe. I wouldn't recommend machine learning output that does not include confidence levels.

There's no escaping the reality that developers are going to have to learn more about how machine learning models really operate. This is similar to developers learning the principles of User Experience and Design. It hasn't been easy, but we've made progress here. The same can happen with machine learning.

In just a couple of weeks at WWDC, Apple will surely announce new improvements to its machine learning framework, Core ML. Right now, it is possible, though difficult, to integrate Google's TensorFlow model into Apple's Core ML, as this post suggests.

We need machine solutions models based on industry standards that can operate on both iOS and Android. I can visualize data scientists working alongside developers and designers building out the next generation of mobile apps. Seamless collaboration with designers has made things better all around. The same kind of collaboration is now needed with ML modelers.

For our clients, the only question now is, "Who will get there first?"

2025 Tech Trends: AI-Propelled Innovation

Agentic AI 101: A Practical Path to Autonomy

Enabling Decision Intelligence with the ADEPT Accelerator

Healthcare Trends 2025: Patient-Centered Experience

Making Accessibility Business as Usual: Maintaining and Optimizing Accessibility Over Time

CapTech Wins Forbes America’s Best Management Consulting Firms for Eight Consecutive Years