I have a program in Golang that does some string manipulations to text from a file. It reads in the file, does the manipulations and then tries to setup the text it displays in the terminal with a red background for any text that was removed and a green background for any text that was added. This works fine until the the input text has unicode characters in it.For example, I have characters like …
, –
, and ◇
that is occasionally present in the text that is get compared and formatted to display in the terminal. I get characters like â\u0080¦
, and â\u0080\u0093
, and â\u0097\u0087
respectively when I run the text through the following logic:
package stringdiffimport ("bytes""strings""github.com/andreyvit/diff""github.com/fatih/color")var ( red = color.New(color.BgRed, color.FgBlack).SprintFunc() green = color.New(color.BgGreen, color.FgBlack).SprintFunc())// GetPrettyDiffString gets the diff string of the 2 passed in values where removals have a red background and additions have a green backgroundfunc GetPrettyDiffString(original, new string) string { diffString := diff.CharacterDiff(original, new) var buff bytes.Buffer var diffsLen = len(diffString) var char, nextChar, nextNextChar, section string var inSection bool for i := 0; i < len(diffString); { char = string(diffString[i]) if char == "(" && i+2 < diffsLen && !inSection { nextChar = string(diffString[i+1]) nextNextChar = string(diffString[i+2]) if nextChar == "+" && nextNextChar == "+" { inSection = true i += 3 continue } else if nextChar == "~" && nextNextChar == "~" { inSection = true i += 3 continue } } else if char == "~" && i+2 < diffsLen && string(diffString[i+1]) == "~" && string(diffString[i+2]) == ")" { inSection = false buff.WriteString(red(section)) section = "" i += 3 continue } else if char == "+" && i+2 < diffsLen && string(diffString[i+1]) == "+" && string(diffString[i+2]) == ")" { inSection = false buff.WriteString(green(section)) section = "" i += 3 continue } if inSection { section += char } else { buff.WriteString(char) } i++ } return convertUnicodeStringsToVisualRepresentations(buff.String())}func convertUnicodeStringsToVisualRepresentations(val string) string { val = strings.ReplaceAll(val, "â\u0080¦", "…") val = strings.ReplaceAll(val, "â\u0080\u0093", "–") val = strings.ReplaceAll(val, "â\u0097\u0087", "◇") return val}
I have had to add convertUnicodeStringsToVisualRepresentations
to handle the unicode characters that I commonly encounter. I am likely hitting an issue where Golang expects the text to only have UTF-8 in it, so it improperly displays the text when it gets converted to the bytes buffer, but I am not 100% certain on that.
Is there a good way to fix this issue?
I have tests present for this logic here if you would like to see what some input looks like for the function.
Please let me know if there is a better way to ask this or if you need more information. Thanks for the help!
Edit: just for simplicities sake, I thought I would add the solution that worked for me here since this issue is being listed as a duplicate.
I looked at the associated question and found that none of the suggested answers worked for me. But then I looked at the answers in the comments and found that one from @ANisus worked. He suggested a solution in a comment that referenced the following golang playground: https://play.golang.org/p/dBrx_ZmrsMN
The logic from that go playground that helped me was
func repairLatin1(s string) (string, error) { buf := make([]byte, 0, len(s)) for i, r := range s { if r > 255 { return "", fmt.Errorf("character %s at index %d is not part of latin1", string(r), i) } buf = append(buf, byte(r)) } return string(buf), nil}
I can't say I like the addition of an error in the logic I am using, but it does seem to pass my tests. I will need to make sure it actually works in an actual scenario, but it does seem to do the trick according to the UTs.