In my freshman year summer I was desperate to work on a project that would make my life more efficient as I am working around my room. I work on different computers at home, and I wanted a way to easily communicate what I want to be done or add tasks easily so I tried to design my own Jarvis-like system. It started off as a very complex task - I came up with an initial version in C++ that would take an audio file, parse it, understand the information, and then do that task. That didn't work out too well.. I was a little too optimistic, but I also was thinking of the solution in a very complex way. I wanted to take simple output and try to parse it word-for-word rather than detect the most similar phrase to it.
There exists a library called the CMU Sphinx, which is a Speech Recognition library. It allows individuals to feed data through the microphone, and it parses the input to give an output. In the middle the user must define a set of phrases it could be in advance. Say "turn on the light", or "turn on that computer", and it would respond to those signals. The user must pre-define all the phrases he/she could possibly say and the library parses data and tries to compare the data to one of those phrases. It then returns the phrase to the user, and the user can run some commands based on the output the library gives. This made my life a little simpler as I could train it to use different commands, and also take a back-seat to the complexity.
The library used a very extensive machine-learning approach where each word was broken down into its pronunciation (like on wikipedia). The library is extremely simple to use, and there is good documentation on it. The next step was for me to feed the data from Java (language library is written in) into Node.JS so I can distribute the work to different machines depending on the input. For example, if I wanted the light on it could interface with the Arduino, etc. This part was a little simpler with Node.JS so I ran with it. There are libraries that have text-to-speech in Node.JS so I used those libraries to parse the output to the speakers, and the voices in those libraries depend on the voices that one has on their laptops. Apple's collection of voices is incredible - they have speakers of all languages, and you can use the translate library in Node.JS and parse the text in, and get a different language speaker to speak the text. (This is a little funky as the translation is not 100% correct, but in my experience with Hindi - it's pretty great).
I made simple functions that would get the weather, and when the user asked "How is the weather" they would speak. It's extremely simple, and easy to distribute (I made them files as in my tests I ended up calling the API too much).
I know that the logic isn't the best, and so aren't the programming practices, but for the sake of speed I hacked out this code. It can do a lot more, and I interfaced with the Arduino and a bunch of other tools to make it enjoyable. I also managed to run the Node.JS code on 2 different machines, and then ran the CMU Sphinx on another machine and used a balancer to balance out the tasks to the 2 different machines. So it was a great learning experience, and a high learning curve, but overall I think it wasn't extremely difficult to start off. All my non-private/non-specific code is here.