New for iOS 10

In previous versions of iOS speech recognition was not the easiest feature to add to an app. Now in iOS 10 Apple has made the technology behind Siri available for developers with the SFSpeechRecognizer API. It provides accurate transcriptions from both live and previously recorded audio. The results also include alternate interpretations of audio, timing, confidence levels, and more.

The API is not difficult to add to your app. The demo app is very small. The recognizer code is in a single controller file and can easily be updated to work for your project. There are four steps to adding SFSpeechRecognizer to an app.

Get User Authorization

First using SFSpeechRecognizer requires permission from the user. To request permission simply update info.plist file of your app to include the NSSpeechRecognitionUsageDescription key along with your usage description. When users first try to use recognition they will be presented an alert with your usage description.

NSSpeechRecognitionUsageDescriptionDescribe how your app will use transcribe data here
Speech Recognition permission alert

This demo app can transcribe live audio from the user with the device microphone. Using the microphone also requires user authorization. Add NSMicrophoneUsageDescription key and description to info.plist. If your app is only using audio files microphone authorization is not required.

NSMicrophoneUsageDescriptionDescribe how your app will use transcribe data here
Microphone permission alert

Request Speech Recognizer Authorization

The second step is requesting authorization with SFSpeechRecognizer.requestAuthorization. The first time this is called an alert with the description the info.plist will display. RequestAuthorization should be called every time SFSpeechRecognizer is used. The alert will only be displayed once but after that it will determine if there is an internet connection available. For most devices a connection is required to send audio for recognition to Apple's servers.

SFSpeechRecognizer.requestAuthorization { authStatus in
 // The callback may not be called on the main thread. Add an
 // operation to the main queue to update the record button's state.
 OperationQueue.main.addOperation {
 var alertTitle = ""
 var alertMsg = ""

 switch authStatus {
 case .authorized:
 do {
 try self.startRecording()
 } catch {
 alertTitle = "Recorder Error"
 alertMsg = "There was a problem starting the
 speech recorder"
 }
 case .denied:
 alertTitle = "Speech recognizer not allowed"
 alertMsg = "You enable the recgnizer in Settings"
 case .restricted, .notDetermined:
 alertTitle = "Could not start the speech recognizer"
 alertMsg = "Check your internect connection and try again"
 }

 if alertTitle != "" {
 let alert = UIAlertController(title: alertTitle, message: alertMsg, preferreStyle: .alert)
 alert.addAction(UIAlertAction(title: "OK", style:.cancel, handler: { (action) in
 self.dismiss(animated: true, completion: nil)
 }))
 self.present(alert, animated: true, completion: nil)
 }
 }
}

Create A Recognition Request

The third step is to create a recognition request. This request contacts the Apple servers to transcribe your audio. There are two types of requests. The SFSpeechURLRecognitionRequest used for prerecorded audio files and the SFSpeechAudioBufferRecognitionRequest used for live audio. For both requests you will need a SFSpeechRecognizer to perform the task. SFSpeechRecognizer supports over 50 locales and languages. If a locale is not supported the init will return nil.

private let speechRecognizer = SFSpeechRecognizer(locale: Locale(identifier: "en-US"))

Create Request for Pre-recorded Audio

Transcribing a pre-recording audio file is extremely easy to implement and only requires a few lines of code. First create get the URL location of your file then create the SFSpeechURLRecognitionRequest.

let fileURL = URL(fileURLWithPath: Bundle.main.path(forResource: "audio", ofType: ".mp3")!)
let request = SFSpeechURLRecognitionRequest(url: fileURL)

Create Request for Live Audio

Using the recognizer with live audio takes a little more work but is still not very difficult. Use AVAudioEngine to capture audio from the device microphone as well as the SFSpeechAudioBufferRecognitionRequest.

private var recognitionRequest:SFSpeechAudioBufferRecognitionRequest?
private let audioEngine = AVAudioEngine()

When the app is ready to record audio call prepare() then start() the audioEngine.

audioEngine.prepare()
try? audioEngine.start()

The audioEngine will throw an exception if start is called after it is already running. Use isRunning to determine the status of audioEngine. As the audioEngine returns data, append it to the recognitionRequest.

guard let inputNode = audioEngine.inputNode else {fatalError("There was a problem with the audio engine") }
let recordingFormat = inputNode.outputFormat(forBus: 0)
inputNode.installTap(onBus: 0, bufferSize: 1024, format:recordingFormat) { (buffer: AVAudioPCMBuffer, when: AVAudioTime) in 
 self.recognitionRequest?.append(buffer)
}

Transcribe Audio with Recognition Task

With the recognitionRequest created and populated with audio it's finally time for the final step and translating audio using the SFSpeechRecognitionTask.

private var recognitionTask: SFSpeechRecognitionTask?

At this point whether the recognitionRequest audio came prerecorded or live makes no difference. The same method can be used to get the translation. Use SFSpeechRecognitionResult.recognitionTask to get a SFSpeechRecognitionResult that contains the bestTranscription and alternate transcriptions.

let recognitionTask : SFSpeechRecognitionTask = recognizer.recognitionTask(with: recognitionRequest, resultHandler: { (result, error) in
 if let error = error {
 print("There was an problem: \(error)")
 } else {
 print (result?.bestTranscription.formattedString)
 }
})

The bestTranscription result is a SFTranscription. It contains the formattedString, substring, substringRange, timestamp, duration, and confidence. Along with the bestTranscription the SFSpeechRecognitionResult also includes an array of alternate transcriptions that contain alternate versions of what the recognitionTask thought was heard. In my testing the data points for SFTranscription are not always populated with data. For some time after starting my app they all return 0.0. Eventually the data is populated, but I have not determined what causes it to begin working.

Use Your Results

What you do with the transcribed results of speech recognition is only limited by your ideas and how accurately you can find individual words and phrases within the results to use as commands or triggers.

In this demo app I perform two basic tasks with the transcribed results. First it simply displays the bestTranscription results and allows the users to add the results as notes or reminders. Second it looks for a short list of phrases like "Add note" or "Reminder." If a phrase is detected, the command phrase is removed from the results and the remaining result is automatically added to the note or reminder list. Because the results are strings it is not very difficult to search the results for specific phrases and descriptions.

To detect phrases there has to be an end point for a recognitionRequest. Part of the SFSpeechRecognitionResult returned is Boolean value called isFinal that will be true when the task has finished transcribing. In my testing I never got an isFinal result to be true. The assumption I made was that stopping the audioEngine would signal the end of a recording and the last result returned from the recognitionRequest would return true for isFinal. Unfortunately that does not seem to happen. Because I could not know when a user is finished recording I have to stop them after a set period of time. Adding an NSTimer with a 10 second delay solves this problem. In viewDidLoad the timer is created and added to the current NSRunLoop.

let timer = Timer(timeInterval: 5.0, target: self, selector: #selector(RecordingViewController.timerEnded), userInfo: nil, repeats: false)
RunLoop.current.add(timer, forMode: .commonModes)

Begin with the phrases to search the results for to use as commands. For this app I will search for either "note " or "remind" because a user would typically begin a phrase with something like "add a note to", "add reminder to", or "remind me to." As the bestTranscription returns witn segments, search those segments for the phrases.

var addNote = false
var addReminder = false

for segment in speechResult.bestTranscription.segments {
 if segment.substringRange.location >= 5 {
 let best = speechResult.bestTranscription.formattedString
 let indexTo = best.index(best.startIndex, offsetBy: segment.substringRange.location)
 let substring = best.substring(to: indexTo)

 addNote = substring.lowercased().contains("note ")
 addReminder = substring.lowercased().contains("remind")
 }
}

If one of those phrases are found, the app will remove the entire command phrase if found from the beginning of the transcription.

let noteCommands:[String] = [
 "note to ",
 "note "
]

let remindCommands:[String] = [
 "remind me to ",
 "remind me ",
 "remind ",
 "reminder to ",
 "reminder "
]

recordedTextLabel.text = remove(commands: noteCommands, from: recordedTextLabel.text)

func remove(commands: [String], from recordedText: String) -> String { 
 var tempText = recordedText

 // Search array of command strings and remove if found
 for command in commands {
 if let commandRange = tempText.lowercased().range(of: command) {
 // Find range from start of recorded text to the endof command found
 let range = Range.init(uncheckedBounds: (lower:tempText.startIndex, upper: commandRange.upperBound))

 // Remove the found range
 tempText.removeSubrange(range)
 print("Updated text: \(tempText)")
 }
 }

 return tempText
}

After a command phrase has been found and removed from the transcription the string remaining is added as a note or reminder depending on the command found.

Conclusion

And that is how to add and make use of Apple's new SFSpeechRecognizer API. As expected the recognition is not perfect but as more developers and users make use of the API it will only become more accurate. With the relative ease that it takes to add the API this should be adopted very soon by the app development community. Along with that will come new code and ideas that make use of the API. I'm very interested to see what comes from the community in the next year. You can find the complete demo and see what I've done here. The code can be a good starting point for your work.

About the Author

Josh Huerkamp

Josh Huerkamp is a Senior Consultant in Captech's Technology Solutions practice area and is based in Charlotte, NC. He has over 10 years of experience and specializes in Java and iOS mobile development.