Quantcast
Channel: Active questions tagged utf-8 - Stack Overflow
Viewing all articles
Browse latest Browse all 1041

getting last character from a string that may or may not be unicode

$
0
0

I'm parsing a file that contains both alpha strings and unicode/UTF-8 strings containing IPA pronunciations.

I want to be able to obtain the last character of a string, but sometimes those characters occupy two spaces, e.g.

    syl = 'tyl'  # plain ascii    last_char = syl[-1]    # last char is 'l'    syl = 'tl̩'  # contains IPA char    last_char = syl[-1]    # last char erroneously contains: '̩' which is a diacritical mark on the l    # want the whole character 'l̩'

If I try using .decode() it fails with

'str' object has no attribute 'decode'

If I try to use .encode().decode(), I'm right back where I started where I just get the diacritical mark instead of the full character.

How to obtain the last character of the unicode/utf-8 string (when you don't know if it's ascii or unicode string)

I guess I could use a lookup table to known characters and if it fails, go back and grab syl[-2:]. Is there an easier way?

.....

In response to some comments, here is the complete list of IPA characters I've collected so far

a, b, d, e, f, f̩, g, h, i, i̩, i̬, j, k, l, l̩, m, n, n̩, o, p, r, s, s̩, t, t̩, t̬, u, v, w, x, z, æ, ð, ŋ, ɑ, ɑ̃, ɒ, ɔ, ə, ə:, ɚ, ɚ:, ɛ, ɜ, ɜ˞, ɝ, ɡ, ɪ, ɵ, ɹ, ɹ:, ɾ, ʃ, ʃ̩, ʊ, ʌ, ʒ, ʤ, θ, ∅,

Viewing all articles
Browse latest Browse all 1041

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>