Resource Profile

Resource ID HtmlTextExtractor
Resource Name HTML Text Extractor
Resource Type Other
Resource Description This service separates a HTML document into texts and a HTML skeleton. For example, when this service receives the following HTML document,
"<html>
  <body>
    <h1>Weather</h1>
    <div>It's fine today.</div>
  </body>
</html>",
it outputs an array of ($1 Weather) and ($2 It's fine today.), and the HTML skeleton, "<html><body><h1>$1</h1><div>$2</div></body></html>".

You can generate the HTML document in other languages by replacing each key ($x) in the skeleton with the corresponding translation; in the above example, replace $1 and $2 with translations of "Weather" and "It's fine today" respectively.

This service interface is defined as below.
<OPERATION>
HTMLDocumentSeparation separate(String htmlDocument)

<INPUT>
htmlDocument - a document in escaped HTML, such as &lt;html&gt;, &lt;h1&gt;, and so on.

<OUTPUT>
HTMLDocumentSepration{
  CodeAndText[] codesAndTexts;
  String skeletonHTML;
}
codesAndText - an array of ID and texts surrounded by a pair of HTML tags.
skeletonHTML - HTML document where the texts are replaced with the corresponding ID.

CodeAndText{
  String code;
  String text;
}
code - ID.
text - texts surrounded by a pair of HTML tags.

<EXAMPLE>
(SOAP request)
<soapenv:Envelope>
  <soapenv:Header/>
  <soapenv:Body>
    <separate>
      <htmlDocument>
      &lt;html&gt;&lt;body&gt;&lt;h1&gt;Weather&lt;/h1&gt;&lt;div&gt;It's fine today&lt;/div&gt;&lt;/body&gt;&lt;/html&gt;
      </htmlDocument>
    </separate>
  </soapenv:Body>
</soapenv:Envelope>

(SOAP response)
<soapenv:Envelope>
  <soapenv:Body>
    <separateResponse>
      <separateReturn>
        <codesAndTexts>
          <codesAndTexts>
            <code>$1</code>
            <text>Weather</text>
          </codesAndTexts>
          <codesAndTexts>
            <code>$2</code>
            <text>It's fine today</text>
          </codesAndTexts>
        </codesAndTexts>
        <skeletonHtml>
        <![CDATA[<html><body><h1>$1</h1><div>$2</div></body></html>]]>
        </skeletonHtml>
      </separateReturn>
    <separateResponse>
  </soapenv:Body>
</soapenv:Envelope>
Languages
-
Copyright Copyright (C) 2007-2009 NICT Language Grid Project. All Rights Reserved.
License LGPL
Provider Language Infrastructure Group, National Institute of Information and Communications Technology
Registration Date 2009/11/18
Last Update Date 2009/11/18
Status Run
Operation by NPO Language Grid Association.