Today I visualized the accuracy of my XML-Importer and the modeling.
It comes with popups that show the imported data of papers/authors. This way i was able to figure out three problems:
Lots of Names have special letters, that my xml-inporter can’t handle yet
I have to transform authors to only use the initials of the pre-names
Titles in the SLR-csv are all lowercase. My importer however imports the uppercase versions. My error-visualization therefore ignores letter-cases.
Authors that aren’t red are either in the right format in the paper, or it’s a faulty hit (For authors I set the required similarities in the name to low. This however is not the case for the titles, there it’s pretty high, so it ignores like one mal-formated character)
So what’s next?
First I will try to fix the encoding issue. After that, I will change the model-creation, so it only uses the initials of pre-names. Additionally I’ll finally implement a caching system. So TODOs in chronological order:
The XMLImport class is now able to import authors that are all aligned on one line as follows:
Comma separated authors are also supported:
What’s not supported yet, are blocks, that are below each other:
I think i will need to detect the form of a block (Name above E-Mail in this case) and search for the same structure below each detected Author.
Maybe i will also be able to detect blocks by comparing line heights. Something like: All lines with the same line height and roughly the same horizontal position are considered to be part of the same block.
(As a side note: I have also improved the title detection)
I’m currently writing an optimistic importer, based on my XML-Extraction Binary, that makes some assumption for the pdfs. I’m done with the importing of titles, and it is able to import the titles of most of pdfs in the SLR folder correctly. I’ll leave it at that for now, until i have my tools for Model-debugging set up. Next up is the extraction of authors.
After I’m done with the optimistic import, i can look at some less convenient papers, and see what i can do about them.
Today I was getting a little exhausted because my code didn’t feel clean at all. That’s why I took the time to create an UML Diagram with a design that feels natural to me. I will now adjust my code to fit this design.
In addition I named my Project EggShell as it’s working title.
(Update: The examples all seem to work in the development version of Moose)
Certain examples from the learningmaterials on agilevisualization.com don’t work. I tried different versions from Moose and different Roassal packages. The first error i encountred, was that the RTDSM class isn’t available. One reoccurring error is, that trans isn’t understood.
Luckily i can correct most of the errors, and still see the examples in action.
I annote the Learning Sites with Scrible. I mark parts, that don’t work in the newest Version of Moose (Version 5.1). Up until now, the examples that didn’t work in Version 5.1 also didn’t work in the version linked from agilevisualization.com (Moose 5, often for different reasons)
Well this one was unexpected. The findTokens method splits up my Strings at the set delimiters (namely new line). Two delimiters in a row count as one, as long, as the quoteDelimiters argument is not set. So I had to set an arbitrary quoteDelimiter in order to prevent empty lines of my text to be ignored.