Mastodon

Mission Statement

Project alpha is an attempt to investigate what the alternatives are to using keyboards for text entry into computing devices. Given the inevitable paradigm shift on the horizon due to the VR/AR revolution, it might be time to take a look at the most used and most overlooked part of a computer, namely: the keyboard.

Part One: Chorded Keyboards

VR on the rise

VR Pioneer Aaron Brancotti

Virtual reality is by no means anything new. Attempts have been made since the 80s. However it was only when legendary programmer John Carmack of Quake fame announced his Occulus project that I believed that it had any chance. Since then the Occulus project has been bought by facebook – an event that cooled my interest in that particular enterprise. However, the avalanche had started. What in particular made me believe in the advent was an astute observation made by a man well worthy the title of genius. Carmack’s observation was that latency was key to the experience. The computer architecture stack that we are using today has evolved organically. Step by step, improvements have been added, each of them adding a little bit of latency. Each update had such a relatively small of an effect and seemed to give so much advantage over the previous, so this largely went unnoticed. However, for the illusion of a virtual reality, even the slightest bit of delay may totally destroy the immersion. As such the key thing for any VR/AR solution is to attack these delays, all other objectives are secondary. If you don’t believe me – try using a window manager from the 90’s – such as fluxbox – and you’ll notice how snappy it feels. Every action is instantaneous. To get back to this while retaining the features that we’ve come to love and adore, a total overlook of the entire stack is necessary. Even putting common assumptions, such as using the kernel abstractions, into question. In order to cut down latencies, process isolation practices with roots in mainframe systems from the 70’s might have to be thrown away. Tighter integration with the kernel circumventing security policies are cumbersome at best and outright counterproductive for gaming systems. Apart from that, even the hardware input stack needs to be reconsidered at a fundamental level. Video streaming will probably be a key element of VR solutions, but here, careful attention has to be paid to buffer sizes and timings of I- and P-frames. No longer will there be room for programmers who load framework on framework, adding GB after GB of libraries out of incompetence and laziness onto software solution. Only programmers with a scarcity mindset will thrive in this environment.

Dodging the Consumption Only trap

The worst case scenario is navigating a virtual keyboard using two dimensional input, apart from a system deliberately attempting inefficiency. This system is however accepted as commonplace in 10 foot interfaces, such as gaming systems and smart-TV solutions. Whereas this might be sufficient for finding the latest Justin Bieber or Lady Gaga song on Youtube, for us who still see computers as an incredible tool for innovation and creation, it remains terribly inefficient. Imagine typing a ‘z’ followed by an ‘o’, this would mean going up two steps and to the right about 6 steps, depending on layout. So that is in total 8 keypresses just to find the keys, not to forget two key presses to actually select the keys. For those who have studied computer science, it might be interesting to compare this to a linear search. Research on user interface has shown that each finger has a potential bandwidth of c.a. 4 bits per second and information theory estimates that the information required to store one word of english text is 1 bit. Presenting this data set in a worse fashion could only be done via implementing a truly linear search instead of a quasi linear search.

So if we are to use our VR controllers to point at virtual, laggy representations of decades old technology with only two pointers, we are worse off than old people writing with only their index finger. This would make VR environments not only impractical but borderline hostile to anyone attempting to create instead of mindlessly consume. Failing this, your 1000 USD state-of-the-art gaming rig would only be usable for you as a means for you to be manipulated by psychological methods leveraging your dopamine system by Silicon Valley companies. The systems meant to free and liberate us become the chains to enslave us; a dystopian nightmare reminiscent of Huxley’s A Brave New World.

B..B..But what about speech recognition?

float InvSqrt(float x){
        float xhalf = 0.5f * x;
        int i = *(int*)&x;            // store floating-point bits in integer
        i = 0x5f3759df - (i >> 1);    // initial guess for Newton's method
        x = *(float*)&i;              // convert new bits into float
        x = x*(1.5f - xhalf*x*x);     // One round of Newton's method
        return x;
    }

Try activating Siri and get her to write and compile the code above, and you’ll realize that there is a long way to go before speech recognition is ready for productive use.

— EDIT: Opinions vary on this one, see comments —

I intend to elaborate on the pros and cons of speech recognition as input. The main point being that upon errors, there is no simple way to correct them. With one of the main sources of errors being similarities between probable phrases, there will be plenty of them. Due to the “high cost” of error correction in terms of time, even with an extremely low error rate, it would still be extremely damaging to performance. One of our primary enemies here is context switching and having to painstakingly correct a speech recognition would be an extremely good example of this. With a keyboard on the other hand, you probably are correcting errors even without noticing it.

What is there to be done?

If you can find it in your heart to forgive me for my longwinded rant, I’d like to introduce you to the work that I am doing. By no means do I claim to have all the solutions to the user interface solutions. What I do have is the aspiration to compile a list of existing venues along with some of my thoughts on them. Stay tuned and you’ll have an as comprehensive analysis on all the alternative input methods that I can find.

Starting with chorded keyboards.

My aim is to do this on an open source collaborative basis. So I might not be actively trying to provoke you by writing glaring inaccuracies – I would not mind having any errors corrected.

Feedback / Discussion

lobste.rs is probably the best forum for computer science at the moment. Check it out for comments from people with minds far sharper than mine. None mentioned none forgotten.

https://lobste.rs/s/jkeeh8/case_for_new_input_method

Spread the word

Join the Conversation

10 Comments

  1. Had to comment because this reminded me of a short-lived experiment. My idea developed on thinking about the “attention bandwidth” required for traditional interfaces (touchscreens) and how that could be improved to reduce traffic accidents.

    This was never fully realized, but i went down the rabbit hole of a gesture based interface. I only got as far as morse code by fingertip connections, but i imagined this whole gesture based “skeuomorph” interface. For example twisting your fingers in a “turning a knob” button might turn the volume up and down, etc.

    Anyway maybe this will spark some ideas in someone.

    https://thehelpfulhacker.net/projects/morse-code-glove/

    1. Hi

      Thank you for reaching out! Your angle seems very interesting and I will investigate it further. I
      sincerely hope that you are willing to cooperate to some extent. My attempt is to create a framework for user input, so that you don’t have to worry about the parts that aren’t interesting to you at the moment. So that you could use someone else prediction engine and so forth.

      Also I am glad to find another musicpd fan! Hands down the best music player ever created. Musicpd brings me to an interesting point, I am a great fan of server – client approaches. The UNIX way of doing thing. As such I intend to create a framework along these lines. Allowing for portability and other nifty features such as remote control and so forth.

      Creating a modular server-client gives another alphabet. I always aim for the highest level of abstraction. As such couldn’t we think of the musicpd database as a dataset. With a well designed protocol it could be integrated using the same mechanics as the text. That is the prediction code ought to be able to treat any form of input, be it code, music metadata or plain old text. If this model is succesfully tied to say a shell interface we have a very potent user interface.

      As far as morse is concerned I consider it a cornerstone of input and will give it a chapter. Maybe you have some thoughts on morse code as such. Someone over at lobste.rs also mentioned it and I am hoping that some cool demo could be made from all of this.

      So make sure to subscribe via RSS if you are able to.

      Do you have a prefered way of following/subscribing to blogs?

      I really hope to hear more from you.

        1. Thank you!

          I’ll be in touch for sure!

          I have a morse article comming up that I think you’ll find very interesting!

  2. Just like twisting your fingers in a motion as a shorthand gesture for “make the music louder” (which is nothing more than using context, and will increase the volume, and will increase the room temperature set point), you could use similar techniques for voice recognition.

    Either

    n^2: “small letter n” (break) “caret” (break) “digit two”

    or a math context based

    “math formula”
    n²: “n raised to the power of two”
    (or, shortened:)
    n²: “b to the second”

    “math formula” sets a math recognition mode, just like sets the HVAC recognition mode.

    See? If a recognition system would “know” of the context it could work with much denser input, elimiting the timing issues for artificial breaks or the insertion of a token seperator.

    And there aren no reasons why one couldn’t combine (gesture, you would want to know the actual temperature before changing it anyway, wouldn’t you) AND ( “two degrees lower” (speech input) OR (gesture) ).

    If you try to determine the condition of your partner, you also do it on audio and visuals and context and probably history.

    “Try activating Siri and get her to write and compile the code above”

    I as a non english native speaker have one additional layer in front of that: I have to think up the english expressions for everything.

    But once some almost 20 years ago, I stumbled over “Perligata”, and you could use it or something similar to dictate code into siri or similar device quite well. https://metacpan.org/pod/Lingua::Romana::Perligata

    If you introduce common expressions for absolute and relative indentations, python syntax should be rather easy, too.

    That said: I really would like to have a gesture/voice module as command shell for my house, I’d happily install mobile phone cameras and microphones in each upper corner of each room, if there is no cloud and no internet connection involved.

    And there will be no single silver bullet.

    If I wake up at night and it is too warm, I won’t turn on the lights in order for the gesture cameras to work, and I wont clear my throat to be able to yell the order to siri or similar device loud and clear enough, I will sneak to the thermostat and gently tap twice. Or my government wakes up and will be very unpleasant 😉

    My dad would be restrained to gesture only, his voice is almost eliminated by parkinsons disease. Also, his view is becoming static, instead of himself staring to some device, he would habe to stare with the remote in his left hand.

    1. Thank you for your comment! Having a clearly defined subset of a language helps out immensly in clearing out the ambiguities.

      I’ve been thinking about having a less rigid floating implementation of this for input prediction. So with several input predictiors giving output their output is summed up and weighed according to how many hits/mises they’ve provided. As to give an idea of where exactly the user is in the library of all possible books so to speak. So if you start to write things like “for(i=0;i<20;i++)" the code predictor would be getting positive feedback by a higher weight in the summation of all algorithms.

      Really good point thank you so much!

      As far as privacy is concerned I am thinking of a set up working like this. All running basically the same code but with different level of distance(latency) and publicity.
      - local device / cache local data
      - private server / large scale analysis on own text
      - public server / shared consisting on data in the scope of hundreds of gigabytes at least

      So to run data analysis on text requires a lot of computing power and storage. A lot of this is common between humans. It would be quite hard to communicate had all individuals had vastly differing vocabularies.

      Provoding a versatile framework for this would allow for this quite easily. As well as allowing for a commersial entity to gain access to their cpu/storage/capacity.

      The protocol is meant to be versatile and allow for any forom of input. So I'd like to think of bash shell as a languauge. To incorporate smart home functionallity into this would be trivial I think!!

      As far as perligata is concerned it seems to be an eye opener for me. I thouught speech recognition was to be unusable for practical work. But with a good subset it makes sense, then it could be practical. Because a language like c is like what 80 keywords, and the algorithm now when nenw variables can be added and can keep track of the rest. You really gave me a lot of food for thought!

      As far as your dad is concerned I think you'll be interested in my next chapter on Dasher input. Which allows people to write quite efficiently with only a mouse for example. Their slogan is basically write anything with only one muscle. Theoretically it's extremely well thought trough and rewarding. Apart from that I'll also be looking at the software that Stephen Hawking(!!) is using, which justs so happens to be open source solution that you can check out! So I find the human part of all of this really really interesting. I really think this should get a lot of attention as opening up the world of commmunication might not be getting full movement capabilities back but it's a bit of solace at least.

      Be sure to find

  3. I think you’re wrong about speech recognition. I think it is very viable, in the near-term future (I’m personally working very hard in pursuit of this), and is a very powerful tool when harnessed correctly. I think casually dismissing it because Siri (which doesn’t have command or language models designed for code) can’t output code, is a needlessly discouraging perspective for people reading your blog who have hand or motor function disabilities and haven’t found a solution yet.

    I had no issues producing the code you posted character-for-character with voice in the first take without first practicing, matching all spacing and capitalization and correcting errors as I went (though I think my comment tab stops were slightly shorter): https://youtu.be/Kjab4fxkkXA

    1. As a matter of fact I have been following your work with great interest! It is an honor.

      Since writing this I have come to the conclusion that from a error point of view speech recognition is well suited for programming due to a very predictable syntax.

      Then again we have open floor solutions. I could not stand listening to 5-10 people reading code all day. Also giving vim commands and the like is a whole different level. Languages have enormous amounts of redundancy built in making them unsuitable for manipulating text or writing math for that matter.

      I am writing a chapter on this (early draft, feedback apriciated):
      https://github.com/TBF-RnD/alpha/blob/cleanup/books/1/Speech_Recognition.md

      and a blog post

      I would be very happy to discuss this further

      1. I’m wary of you trying to produce “the framework” for alternative input methods, when you aren’t listening to people *showing* you their alternative input methods. He just showed you that your example was… a bad example. Are you sure you understand the field well enough to be an authority?

        How much of this is based on theory vs. actual experience? Have you sat down with a voice programmer and watched them work yet? When are you intending on it?

        Vim is *easier* to control by voice than most other programs, *because* it has so many commands. Emacs too. Our brains are much better adapted for language than keychords.

        >What will happen is that after the error the user has to tell the computer to replace the word, by no means an efficient operation in terms of bits compared to for instance selecting with a mouse.

        This makes me think you haven’t tried many solutions. There are loads of options, but fundamentally correction is a lot easier than the intitial input. You just say, “correct “, wait a few milliseconds, then speak a letter/number corresponding to your word. Here’s the key: it then gives you a *list of alternative hypotheses,* and you select one. Using the mouse isn’t that much more efficient – the problem is having to correct at all. You don’t have to make those kinds of corrections on a keyboard.

        >If the engine did not understand it the first time why would it do it the next time?

        Because you just told it what it got wrong, so a good engine can retrain itself. This is how the most popular offline speech recognition engine, Dragon, works already.

        Current engines aren’t flexible enough to do a lot of these things, but it’s really not that far off. As in, within a year the popular speech programming frameworks (do you know what they are?) will probably have this functionality built into their engines.

        1. I don’t think it is a bad example at all. Obviously you have decided that speech recognition is the future and there is no reasons for alternatives, sorry that you can’t be more open minded.

          No I have not researched speech exhaustively yet. So I am very happy for your feedback. Read my comment again and you will find that I have reconsidered my views, am interested and impressed by his work and want to discuss this further.

          As far as your building “the” framework I see it as “a framework” to test my ideas. Being open source anybody would be able to fix errors.

          My criticism still stands, what correct n represents could with key input be done in a few bits.

          Spoken language has evolved to have a lot of redundancy there is no way around that. That is an intrinsic disadvantage that speech recognition has.

          Surely a lot of the issues with speech will be resolved with time. However to get a good model you will need to approach the complexity of the brain.

          This is not meant to be a hit piece on speech recognition in any shape or form. Initially I wanted this as a project to focus on button input. I am sure speech recognition is good for many a thing and im a future when AI approaches ours a different beast.

          One other disadvantage with speech recognition is that it is unsuitable for places where several people work at the same time.

          Also in terms of bandwidth of commands imagine a high performance Star craft player, that is a different language so to speak and with and could you seriously see any talk those commands that fast? Seriously?

Leave a comment

Your email address will not be published. Required fields are marked *