The Power of Large Language Models

Behind Empath’s Employee Skills Inference

Recently, there has been an exploding level of interest in OpenAI’s ChatGPT. Its ability to provide meaningful answers to a wide range of questions has dazzled a broad range of users beyond the narrow universe of AI practitioners. ChatGPT is powered by the large language model GPT-3, the latest in a line of large language model innovation (LLMs). GPT-3 is trained on a huge corpus of books and Wikipedia entries, primarily using unsupervised learning (training where no labeled data of correct or incorrect results is provided). 

For specific tasks where it is feasible to provide labeled data (marking answers to questions as correct or identifying generated text as relevant), fine-tuning an LLM is a better approach and can lead to much greater accuracy. For fine tuning for specific tasks (performing additional training on the base LLM with labeled data), Google’s open source BERT model (or its successors) has become the dominant technique. ChatGPT itself was fine-tuned with supervised learning and reinforcement learning to drive better text generation, especially to avoid harmful or biased responses. 

Examples of tasks that require  much higher accuracy include discussing symptoms and receiving medical advice and prescriptions, presenting a legal argument and hoping to get back the most relevant legal cases to buttress it, and describing a car’s mechanical problem and retrieving repair instructions. Simply using unsupervised LLMs will not yield the level of accuracy needed for these or similar tasks. 

Let’s describe a specific task that we are very familiar with: inferring skills for employees in a company’s workforce. Empath was started in 2020 with the goal of bringing truly accurate skills inference for a company’s employees.  With a complete and accurate skills inventory in hand, employees can plan their next job and career progression and companies can more efficiently build teams and manage their workforce. 

We infer skills by taking the full “digital footprint” of all employees: (any source of language to, from, or about employees) and determine not just that the employee has a skill but also their proficiency in that skill. Proficiency levels are critical since many workforce skills are almost universally possessed by each employee in a company’s workforce (so inferring skills as binary tags is not actually useful). Examples include specific capabilities such as project management and Microsoft Excel, and soft skills such as teamwork and collaboration. Of course, to infer properly we need meaningful and distinct descriptions of the behavior exhibited for each proficiency level of the skill. Fortunately, companies and consulting firms such as IBM and EY have been creating richly described skills taxonomies for decades. 

What we’ve done, is to determine which skills proficiency levels were semantically similar to the language available in an employee’s digital footprint. The meaning of the skill description can be expressed as an “embeddings vector” (a sequence of several hundred numbers that represents the precise meaning of a fragment of language). The text in the employee’s digital footprint (or more specifically a part of the footprint as we will discuss shortly) can also be expressed as an embeddings vector.  The cosine distance reflects how similar the two fragments of text are, whether or not they match on any specific words whatsoever. Leveraging the power of BERT and successor LLMs we were able to perform this task with greater than 95 percent accuracy at our first customer AT&T.  

But aren’t there other ways of doing skills inference, you might ask? For at least a decade, several other companies have claimed to infer skills, albeit usually focused on recruiting. The general approach has been to take resumes for job candidates that are filled with keywords and phrases and mark the candidate as having the relevant skills. These products maintain simple skill libraries, which are really just lists of tags (no levels or descriptions). If the resume has sufficient presence of one of the skills tags then the candidate is  marked as having the skill. This is a reasonable approach to characterizing an unknown person and presumably made selecting candidates for interviews a bit more efficient. Unfortunately, this method cannot determine levels of proficiency in skills. It is also reliant on matching keywords or synonyms of skills, versus using the underlying meaning to match employees and skills via semantic similarity, as we do with LLMs. 

Over the last three years we have continuously tracked improvements in LLMs, and more importantly optimized our usage of them. Among the many non-obvious improvements we have made is feeding the embeddings vectors, along with various metrics for distance, to large machine machine learning models where each part of the employee’s digital footprint (their performance review, their interaction in a work system such as their CRM or issue data, and their project descriptions) is separately weighted by the machine learning model (indeed we have a patent pending on this combined LLM/ML approach). With large amounts of labeled data from the companies we work with (whose employees mark the inferred skills and levels as correct or not) we have also extensively fine tuned successor models to BERT for this specific task. This also allows us to determine which parts of the employee footprint are worth the effort to continue to extract information from.

Due to their ability to provide meaningful responses to any question, large language models trained against a corpus with unsupervised learning can yield seemingly magical results when applied to the problem of unrestricted chat. Yet, LLMs that are fine-tuned for a specific task with labeled data of correct and incorrect results are actually much more powerful in their capacity to change how work is done. As our customers know, one of the most valuable of these specific tasks is inferring skills for company employees.