• Skip to primary navigation
  • Skip to main content
The Data Lab

The Data Lab

Pruple button with the word menu
  • About Us
        • About Us

           
          Visit our About Us page

        • Careers
        • Our Team
        • Impact
        • The Scottish AI Alliance
        • Contact us
  • Business
        • For Business

           

          Visit our Business Support page

        • Access Talent
        • Funding and Business Support
        • Partnerships
  • Professionals
        • For Professionals

           

          Visit our Professional Development page

        • Online Courses
        • Data Skills for Work
  • Students
        • For Students

           

          Visit our Students page

        • The Data Lab Academy
        • Student Placements
        • Scholarships
  • Universities and Colleges
        • For Universities and Colleges

           

          Visit our Universities and Colleges page

        • Funding and Support
        • Collaborate and Innovate
        • Academic Projects
  • Community
        • Community

           

          Visit our Community page

        • Online Community
        • News
        • Case Studies
        • DataFest

The origin of Star Wars characters’ names determined using Data Science

News 12/06/2015

In preparation for the new Star Wars movie premier, Richard and Roman, the Data Scientists at The Data Lab, have analysed several hundred characters from the Star Wars films and associated series’ to determine from which language each name is most likely to have come.

A list of over 500 names was taken from Wikipedia and on each an n-gram model from artificial intelligence was performed. The n-gram model, from the field known as natural language processing, first splits the name into a sequence of single, double, and triple character strings. For example, the name “Luke” decomposes into the strings “l”, “u”, “k”, “e”, “lu”, “uk”, “ke”, “luk”, and “uke”. Using a piece of software called textcat – short for “text categorisation” – the frequency of the resulting strings is compared with those of dozens of language corpuses. From this the software is able to calculate probabilities of a given name coming from each of the languages. The most likely language is noted for each character name.

The authors are keen to point out that this exercise is done for fun and that the results are not meant to be taken too seriously. The technique is only really applicable to larger bodies of text and is typically used to categorise written works by, for example, similarity, author or subject matter. The research did throw up some interesting conclusions, however.

The names span a huge number of different languages, from the readily familiar to the rather more obscure. Middle Frisian, for example, was spoken around the Netherlands, Germany and southern Denmark in the 17th and 18th centuries, whilst Tagalog is a modern-day language from the Philippines.

There appears to be a connection between the names of the Hutt characters and Scottish. In addition to Jabba the Hutt, each of Borvo the Hutt, Gardulla the Hutt, Mama the Hutt, Rotta the Hutt, Ziro the Hutt, and Zorba the Hutt maps to Scottish, as does Sy Snootles, the lead vocalist in Jabba’s house band in Episode VI – Return of the Jedi.

The full list of names can be found here.

Innovate • Support • Grow • Respect

Get in touch

t: +44 (0) 131 651 4905

info@thedatalab.com

Follow us on social

  • Twitter
  • YouTube
  • Instagram
  • LinkedIn
  • TikTok

The Data Lab is part of the University of Edinburgh, a charitable body registered in Scotland with registration number SC005336.

  • Contact us
  • Partnerships
  • Website Accessibility
  • Privacy Policy
  • Terms & Conditions

© 2025 The Data Lab