Quantcast
Channel: Active questions tagged utf-8 - Stack Overflow
Viewing all articles
Browse latest Browse all 1135

Translate UTF-8 punctuation with normal ascii punctuation marks

$
0
0

I'm trying to cleanup a raw data that has embedded \r\n or \n in csv lines.Line terminator is \r\n.

  • trying to translate utf-8 punctuation marks to normal ascii punctuation marks.
  • cleaning up any extra utf-8 out of ascii range 00-7f.

I'm able to weave below Mule 4 dataweave code except the punctuation translation logic.used reduce but its not translating correctly.

dataweave code:

    %dw 2.0    output application/csv header=true    var translateMap = {"‘": "'", "’": "'", "‚": "'", "‛": "'","“": "\"", "”": "\"", "„": "\"","–": "-", "—": "--", "―": "--","…": "...", "•": "*","′": "'", "″": "\"","‹": "<", "›": ">", "«": "<<", "»": ">>"," ": " ", "‐": "-", "‑": "-", "‒": "-", "−": "-","©": "(c)", "®": "(R)", "™": "(TM)"    }fun cleanField(value: String) = (    translateMap reduce ((acc, pair) -> acc replace pair.key with pair.value)      replace /(\r\n|\n)/ with " "      replace /[^\x00-\x7F]/ with "")---payload map (row) ->  row mapObject (key, value) -> {     (value) : cleanField(key)  }

Sample Data:

Header1|Header2|Header3|Header4|Header5|Header6\r\nValue1A|Value1B|Value1C|Value1D|Value1E|Value1F\r\nValue2A|Value2B|Value2C—with—emdash|Value2D|Value2E|Value2F\r\nValue3A|Value3B|Value3C|Value3D ␍\f mid-line |Value3E|Value3F\r\nValue4A|‘Single’Quote|“Double”Quote|Value4D|Value4E|Value4F\r\nValue5A|Value5B|Value5C|Value5D|Value5E|Value5F‐hyphen\r\nExplanation of the Sample Data:

  • Line Terminator: Each line ends with \r\n as requested. PipeSeparated: Fields within each record are separated by the pipe symbol|.

  • Header: The first line contains the header row:
    Header1|Header2|Header3|Header4|Header5|Header6.

  • Five Records: There are five data rows following the header.

  • Six Columns: Each record has six values separated by pipes.

  • UTF-8 Punctuation Marks:

    Line 4 contains a left single quotation mark ‘ and a right double quotationmark “.

    \r\n in Mid-Line: Line 3 contains \r\n mid-line. The \r(carriage return) and \n (form feed) characters are embedded withinthe "Value3D" field.Emdash: Line 2 contains an emdash — within the"Value2C" field.UTF-8 Hyphen: Line 5 contains a non-standard hyphen‐ (U+2010, Hyphen) at the end of the "Value5F" field.


Viewing all articles
Browse latest Browse all 1135

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>