I'm trying to cleanup a raw data that has embedded \r\n or \n in csv lines.Line terminator is \r\n.
- trying to translate utf-8 punctuation marks to normal ascii punctuation marks.
- cleaning up any extra utf-8 out of ascii range 00-7f.
I'm able to weave below Mule 4 dataweave code except the punctuation translation logic.used reduce but its not translating correctly.
dataweave code:
%dw 2.0 output application/csv header=true var translateMap = {"‘": "'", "’": "'", "‚": "'", "‛": "'","“": "\"", "”": "\"", "„": "\"","–": "-", "—": "--", "―": "--","…": "...", "•": "*","′": "'", "″": "\"","‹": "<", "›": ">", "«": "<<", "»": ">>"," ": " ", "‐": "-", "‑": "-", "‒": "-", "−": "-","©": "(c)", "®": "(R)", "™": "(TM)" }fun cleanField(value: String) = ( translateMap reduce ((acc, pair) -> acc replace pair.key with pair.value) replace /(\r\n|\n)/ with " " replace /[^\x00-\x7F]/ with "")---payload map (row) -> row mapObject (key, value) -> { (value) : cleanField(key) }
Sample Data:
Header1|Header2|Header3|Header4|Header5|Header6\r\nValue1A|Value1B|Value1C|Value1D|Value1E|Value1F\r\nValue2A|Value2B|Value2C—with—emdash|Value2D|Value2E|Value2F\r\nValue3A|Value3B|Value3C|Value3D ␍\f mid-line |Value3E|Value3F\r\nValue4A|‘Single’Quote|“Double”Quote|Value4D|Value4E|Value4F\r\nValue5A|Value5B|Value5C|Value5D|Value5E|Value5F‐hyphen\r\nExplanation of the Sample Data:
Line Terminator: Each line ends with \r\n as requested. PipeSeparated: Fields within each record are separated by the pipe symbol|.
Header: The first line contains the header row:
Header1|Header2|Header3|Header4|Header5|Header6.Five Records: There are five data rows following the header.
Six Columns: Each record has six values separated by pipes.
UTF-8 Punctuation Marks:
Line 4 contains a left single quotation mark ‘ and a right double quotationmark “.
\r\n in Mid-Line: Line 3 contains \r\n mid-line. The \r(carriage return) and \n (form feed) characters are embedded withinthe "Value3D" field.Emdash: Line 2 contains an emdash — within the"Value2C" field.UTF-8 Hyphen: Line 5 contains a non-standard hyphen‐ (U+2010, Hyphen) at the end of the "Value5F" field.