XML-Import over SLR Folder

Today I visualized the accuracy of my XML-Importer and the modeling.

Visualization of accuracy of the import. Red squares are papers with wrong names, Red circles are collaborators with wrong names.

It comes with popups that show the imported data of papers/authors. This way i was able to figure out three problems:

  • Lots of Names have special letters, that my xml-inporter can’t handle yet
  • I have to transform authors to only use the initials of the pre-names
  • Titles in the SLR-csv are all lowercase. My importer however imports the uppercase versions. My error-visualization therefore ignores letter-cases.

Authors that aren’t red are either in the right format in the paper, or it’s a faulty hit (For authors I set the required similarities in the name to low. This however is not the case for the titles, there it’s pretty high, so it ignores like one mal-formated character)

So what’s next?

First I will try to fix the encoding issue. After that, I will change the model-creation, so it only uses the initials of pre-names. Additionally I’ll finally implement a caching system. So TODOs in chronological order:

  1. Fix encoding
  2. Fix name-format
  3. Caching
XML-Import over SLR Folder

XMLImport author detection

The XMLImport class is now able to import authors that are all aligned on one line as follows:


Comma separated authors are also supported:


What’s not supported yet, are blocks, that are below each other:


I think i will need to detect the form of a block (Name above E-Mail in this case) and search for the same structure below each detected Author.

Maybe i will also be able to detect blocks by comparing line heights. Something like: All lines with the same line height and roughly the same horizontal position are considered to be part of the same block.

(As a side note: I have also improved the title detection)

XMLImport author detection

Optimistic XML Import

I’m currently writing an optimistic importer, based on my XML-Extraction Binary,  that makes some assumption for the pdfs. I’m done with the importing of titles, and it is able to import the titles of most of pdfs in the SLR folder correctly. I’ll leave it at that for now, until i have my tools for Model-debugging set up. Next up is the extraction of authors.

After I’m done with the optimistic import, i can look at some less convenient papers, and see what i can do about them.

Optimistic XML Import


I’m finished with the first version of my pdf-to-xml tool. I’ve decided upon a structure that looks as follows:

  <block left="84" top="72">
    <span id="f1" font-size="17" vertical-align="baseline" color="#000000" font-family="sans-serif" font-weight="bold" font-style="normal">
      Heapviz: Interactive Heap Visualization for Program
  <block left="174" top="92">
    <span id="f1" font-size="17" vertical-align="baseline" color="#000000" font-family="sans-serif" font-weight="bold" font-style="normal">
      Understanding and Debugging

Now i have to figure out how to extract the needed data from this xml. For example, how do i detect, that a sentence is split across multiple blocks like in the example above.


Problems with examples for Roassal

(Update: The examples all seem to work in the development version of Moose)

Certain examples from the learningmaterials on agilevisualization.com don’t work. I tried different versions from Moose and different Roassal packages. The first error i encountred, was that the RTDSM class isn’t available. One reoccurring error is, that trans isn’t understood.

Luckily i can correct most of the errors, and still see the examples in action.

I annote the Learning Sites with Scrible. I mark parts, that don’t work in the newest Version of Moose (Version 5.1). Up until now, the examples that didn’t work in Version 5.1 also didn’t work in the version linked from agilevisualization.com (Moose 5, often for different reasons)

You can check out my current annotations here:

Problems with examples for Roassal

Loading Data from PDFs into Pharo

Well, this one took way too long to accomplish.

In order to get the information from PDFs into Pharo, I had to find a simple cross-platform binary to turn them into text. There are fancy tools for extracting data from science Papers (Authors, Citations etc.). Those depend on Machine Learning and are either huge in Size (500MB and above) or inaccurate.

I still have to decide where i will keep this binary. I execute it using PipeableOSProcess (CommandShell somehow doesn’t recognize the binary). Both OSProcess and CommandShell don’t seem to be able to access all binaries available in the Terminal. (I installed pdftotext using homebrew. Neither of the Classes were able to use it).

I now have the pdf available split up in lines. I will try to figure out what the names are as follows: “Check if the Line below the name candidate contains an @ character or something like University or a name of a state/town/country”. I will have to test multiple pdfs in order to make sure if this is accurate.

One more thing: pdftotext apparently has some quirks. One Pdf was extracted into a text t h a t  l o o k s  l i k e  t h i s. Given that pdf is heavily focused on layout and not information, i think I will be running into more problems like this.

Loading Data from PDFs into Pharo