Resource ID | HtmlTextExtractor |
---|---|
Resource Name | HTML Text Extractor |
Resource Type | Other |
Resource Description | This service separates a HTML document into texts and a HTML skeleton. For example, when this service receives the following HTML document,
"<html> <body> <h1>Weather</h1> <div>It's fine today.</div> </body> </html>", it outputs an array of ($1 Weather) and ($2 It's fine today.), and the HTML skeleton, "<html><body><h1>$1</h1><div>$2</div></body></html>". You can generate the HTML document in other languages by replacing each key ($x) in the skeleton with the corresponding translation; in the above example, replace $1 and $2 with translations of "Weather" and "It's fine today" respectively. This service interface is defined as below. <OPERATION> HTMLDocumentSeparation separate(String htmlDocument) <INPUT> htmlDocument - a document in escaped HTML, such as <html>, <h1>, and so on. <OUTPUT> HTMLDocumentSepration{ CodeAndText[] codesAndTexts; String skeletonHTML; } codesAndText - an array of ID and texts surrounded by a pair of HTML tags. skeletonHTML - HTML document where the texts are replaced with the corresponding ID. CodeAndText{ String code; String text; } code - ID. text - texts surrounded by a pair of HTML tags. <EXAMPLE> (SOAP request) <soapenv:Envelope> <soapenv:Header/> <soapenv:Body> <separate> <htmlDocument> <html><body><h1>Weather</h1><div>It's fine today</div></body></html> </htmlDocument> </separate> </soapenv:Body> </soapenv:Envelope> (SOAP response) <soapenv:Envelope> <soapenv:Body> <separateResponse> <separateReturn> <codesAndTexts> <codesAndTexts> <code>$1</code> <text>Weather</text> </codesAndTexts> <codesAndTexts> <code>$2</code> <text>It's fine today</text> </codesAndTexts> </codesAndTexts> <skeletonHtml> <![CDATA[<html><body><h1>$1</h1><div>$2</div></body></html>]]> </skeletonHtml> </separateReturn> <separateResponse> </soapenv:Body> </soapenv:Envelope> |
Languages |
|
Copyright | Copyright (C) 2007-2009 NICT Language Grid Project. All Rights Reserved. |
License | LGPL |
Provider | Language Infrastructure Group, National Institute of Information and Communications Technology |
Registration Date | 2009/11/18 |
Last Update Date | 2009/11/18 |
Status | Run |