Unable to retrieve table elements using jsoup

I’m new to using jsoup and I am struggling to retrieve the tables with class name: verbtense with the headers: Present and Past, under the div named Indicative from the from this site: https://www.verbix.com/webverbix/Swedish/misslyckas

I have started off trying to do the following, but there are no results from the get go:

Document document = Jsoup.connect("https://www.verbix.com/webverbix/Swedish/misslyckas").get();
Elements tables = document.select("table[class=verbtense]"); // empty

I also tried this, but again no results:

        Document document = Jsoup.connect("https://www.verbix.com/webverbix/Swedish/misslyckas").get();

        Elements divs = document.select("div");


        if (!divs.isEmpty()) {
            for (Element div : divs) {
                // all of these are empty
                Elements verbTenses = div.getElementsByClass("verbtense");
                Elements verbTables = div.getElementsByClass("verbtable");
                Elements tables = div.getElementsByClass("table verbtable");
            }
        }

What am I doing incorrectly?

Answer

The first catch is that this page loads its content asynchronously using AJAX and uses JavaScript to add the content to the DOM. You can even see the loader for a short time. enter image description here

Jsoup can’t parse and execute JavaScript so all you get is the initial page ๐Ÿ™ The next step would be to check what the browser is doing and what is the source of this additional content. You can check it using Chrome’s debugger (Ctrl + Shift + i). If you open Network tab, select only XHR communication and refresh the page you can see two requests: enter image description here

One of them gets such content https://api.verbix.com/conjugator/iv1/ab8e7bb5-9ac6-11e7-ab6a-00089be4dcbc/1/21/121/misslyckas as you can see it’s a JSON with HTML fragments and this content seems to have verbs forms you need. But here’s another catch because unfortunately Jsoup can’t parse JSON ๐Ÿ™ So you’ll have to use another library to get the HTML fragment and then you can parse it using Jsoup. General advice to download JSON is to ignore content type (Jsoup will complain it doesn’t support JSON):

String json = Jsoup.connect("https://api.verbix.com/conjugator/iv1/ab8e7bb5-9ac6-11e7-ab6a-00089be4dcbc/1/21/121/misslyckas").ignoreContentType(true).execute().body();

then you’ll have to use some JSON parsing library for example json-simple to obtain html fragment and then you can parse it to HTML with Jsoup:

String json = Jsoup.connect(
    "https://api.verbix.com/conjugator/iv1/ab8e7bb5-9ac6-11e7-ab6a-00089be4dcbc/1/21/121/misslyckas")
    .ignoreContentType(true).execute().body();
System.out.println(json);
JSONObject jsonObject = (JSONObject) JSONValue.parse(json);
String htmlFragmentObtainedFromJson = (String) ((JSONObject) jsonObject.get("p1")).get("html");
Document document = Jsoup.parse(htmlFragmentObtainedFromJson);
System.out.println(document);

Now you can try your initial approach with using selectors to get what you want from document object.

Leave a Reply

Your email address will not be published. Required fields are marked *