I'm writing a crawler in java to crawl some websites, which may have some unicode characters such as "£". When I stored the content (source HTML) in a Java String, these kinds of chars get lost and are replaced by the question mark "?". I'd like to know how to keep them intact. The related code is as follows:
protected String readWebPage(String weburl) throws IOException{ HttpClient httpclient = new DefaultHttpClient(); HttpGet httpget = new HttpGet(weburl); ResponseHandler<String> responseHandler = new BasicResponseHandler(); String responseBody = httpclient.execute(httpget, responseHandler); // responseBody now contains the contents of the page httpclient.getConnectionManager().shutdown(); return responseBody; } // function call String res = readWebPage(url); PrintWriter out = new PrintWriter(outDir+name+".html"); out.println(res); out.close();
And later when doing character matches, I also want to be able to do something like:
if(text.indexOf("£")>=0)
I don't know if Java will recognize that character and do as what I want it to do.
Any input will be greatly appreciated. Thanks in advance.