Today I visualized the accuracy of my XML-Importer and the modeling.
It comes with popups that show the imported data of papers/authors. This way i was able to figure out three problems:
Lots of Names have special letters, that my xml-inporter can’t handle yet
I have to transform authors to only use the initials of the pre-names
Titles in the SLR-csv are all lowercase. My importer however imports the uppercase versions. My error-visualization therefore ignores letter-cases.
Authors that aren’t red are either in the right format in the paper, or it’s a faulty hit (For authors I set the required similarities in the name to low. This however is not the case for the titles, there it’s pretty high, so it ignores like one mal-formated character)
So what’s next?
First I will try to fix the encoding issue. After that, I will change the model-creation, so it only uses the initials of pre-names. Additionally I’ll finally implement a caching system. So TODOs in chronological order:
The XMLImport class is now able to import authors that are all aligned on one line as follows:
Comma separated authors are also supported:
What’s not supported yet, are blocks, that are below each other:
I think i will need to detect the form of a block (Name above E-Mail in this case) and search for the same structure below each detected Author.
Maybe i will also be able to detect blocks by comparing line heights. Something like: All lines with the same line height and roughly the same horizontal position are considered to be part of the same block.
(As a side note: I have also improved the title detection)
I’m currently writing an optimistic importer, based on my XML-Extraction Binary, that makes some assumption for the pdfs. I’m done with the importing of titles, and it is able to import the titles of most of pdfs in the SLR folder correctly. I’ll leave it at that for now, until i have my tools for Model-debugging set up. Next up is the extraction of authors.
After I’m done with the optimistic import, i can look at some less convenient papers, and see what i can do about them.
Today I was getting a little exhausted because my code didn’t feel clean at all. That’s why I took the time to create an UML Diagram with a design that feels natural to me. I will now adjust my code to fit this design.
In addition I named my Project EggShell as it’s working title.
(Update: The examples all seem to work in the development version of Moose)
Certain examples from the learningmaterials on agilevisualization.com don’t work. I tried different versions from Moose and different Roassal packages. The first error i encountred, was that the RTDSM class isn’t available. One reoccurring error is, that trans isn’t understood.
Luckily i can correct most of the errors, and still see the examples in action.
I annote the Learning Sites with Scrible. I mark parts, that don’t work in the newest Version of Moose (Version 5.1). Up until now, the examples that didn’t work in Version 5.1 also didn’t work in the version linked from agilevisualization.com (Moose 5, often for different reasons)
Well this one was unexpected. The findTokens method splits up my Strings at the set delimiters (namely new line). Two delimiters in a row count as one, as long, as the quoteDelimiters argument is not set. So I had to set an arbitrary quoteDelimiter in order to prevent empty lines of my text to be ignored.
I am using the method pathString of the class FileReference to get the Path for my terminal command. The returned string however is not escaped to be used in the terminal. I am now writing a method to escape a string of a path so it can be used in a terminalcommand.
In order to get the information from PDFs into Pharo, I had to find a simple cross-platform binary to turn them into text. There are fancy tools for extracting data from science Papers (Authors, Citations etc.). Those depend on Machine Learning and are either huge in Size (500MB and above) or inaccurate.
I still have to decide where i will keep this binary. I execute it using PipeableOSProcess (CommandShell somehow doesn’t recognize the binary). Both OSProcess and CommandShell don’t seem to be able to access all binaries available in the Terminal. (I installed pdftotext using homebrew. Neither of the Classes were able to use it).
I now have the pdf available split up in lines. I will try to figure out what the names are as follows: “Check if the Line below the name candidate contains an @ character or something like University or a name of a state/town/country”. I will have to test multiple pdfs in order to make sure if this is accurate.
One more thing: pdftotext apparently has some quirks. One Pdf was extracted into a text t h a t l o o k s l i k e t h i s.Given that pdf is heavily focused on layout and not information, i think I will be running into more problems like this.