Sunday, June 02, 2013

Manual Copying of Publicly Available Information on the Web with Intelligent Aggregation - Can This Be Worthwhile Research?

Let me preface what I have to say by noting that when I was doing economic research it was all highly theoretical - math models and analysis thereof.  The models may have been motivated by the research of somebody else, with that work having an empirical basis.  But I simply trusted those results.  I didn't look at other data to validate the results on my own.  So at heart I'm a theorist and come to questions with a theorist's sensibility.

Nonetheless, I developed something of an empirical approach when I became a campus administrator, at least from time to time. It started straight away with the evaluation of the SCALE project. I became directly involved with the faculty component, along with Cheryl Bullock who was working on the evaluation.  We had a few core questions we wanted each interviewee to answer for the sake of the evaluation.  But I also wanted to get to know these people, to develop my own personal network and to learn about their motives and uses of technology in their teaching.  For that a more open ended kind of discussion was appropriate and we mixed and matched on those. It worked out pretty well.

I did a different sort of project that is more in line with the title of this post.  We had been making a claim that ALN (what we called it then, you can call it eLearning now) would improve retention in classes.  Yet nobody had bothered, in advance, to find out what retention looks like at Illinois on a class by class basis.  So I got the the folks who deal with with institutional data to get me a few years data with 10-day enrollments and final enrollments for every class offered on campus.  This wasn't highfalutin research.  I plopped the data into Excel and had it compute the ratio of the final number to the 10-day number, represented as a percentage.  There may have been adds and drops in between.  For my purposes these were inessential, so I ignored them.  If it wasn't easy to track those; for starters lets look at what could readily be tracked.  So I called my ratios the retention rates.  The finding was that by and large those were so high - often above 95% - that it would be impossible for ALN to improve them much if at all, since they were bounded above by 100%.  At schools where those ratios were lower, then looking at how ALN might improve things was sensible.  This simply wasn't an interesting question at Illinois at the time.  We didn't concern ourselves with it further in  the evaluation.

Fast forward ten years or so.  At the time I was a new Associate Dean in the College of Business and had just returned from a trip to Penn State at their Smeal College of Business for a meeting with my counterparts from various institutions. They had a brand new building and were very proud of how many classes students would be offered in the new building instead of elsewhere around campus.  I learned that they had procured scheduling software within the College to pre-schedule classes, with a major effort to have College classes meet in the new building.  Then, when dealing with the University, which did the official scheduling, they already had a plan that featured high utilization in the new building.  Since we were opening a new building a year later, I got it in my head to imitate what Smeal had already been doing.

But I needed to convince myself so I could then convince others in the College that such a purchase of scheduling software would be worth it.  To do that, I needed to do several things.  One of those was to understand how we were doing with scheduling at present.  Conceivably, one could have tried to run some reports out of Banner for this purpose.  However, I didn't know how to do that and I was unsure whether it would give me the information I needed.  So I did something much more primitive, but where I was surer it would give me the results I wanted.

I went to the online Course Schedule (this was for the fall 2007 semester).  The data are arranged first by Rubric, then by course number, then section.  It gives the time of class meeting and the room.  The College of Business has three departments - Accountancy, Business Administration, and Finance, and I believe 5 Rubrics.  (MBA is a separate rubric and the College has a rubric BUS).  So I gathered this information, section by section, and put into Excel.

Now a quick aside for those who don't know about class scheduling.  Most classrooms are "General Assignment" and that means that conceivably any class on campus might be scheduled into the particular room.  There are some rooms that are controlled by the unit.  The Campus doesn't schedule those rooms and I won't spend much time on the ones that College of Business controlled in what follows.  It is well understood that a faculty member would prefer to teach in the same building where he or she has an office, all else equal, and it is probably better for students in a major to take classes in the building where the department is housed.

To address this, the Campus gives "priority" for scheduling a particular classroom to a certain department.  Before such and such a date, only that department can schedule classes into the room.  After that date the room becomes generally available for other classes not in the department to use the room.  There is no department with second priority - that gets at the room after the first department has done its work.  Timing-wise, that just wouldn't work.  It's largely the reason why a College pre-scheduling within its own building can improve on building utilization.  They can get started earlier and negotiate allocations across departments.

Ahead of time I did not know which classrooms our departments had priority for.  I let the data tell me this.  So, after collecting the data as I described above, I need to invert it to at look at it from a room basis rather than from a rubric-course number basis.  Excel's sort function is quite good for doing this - provided you enter the data in a way where it can be readily sorted.  There is some intelligence, therefore, in anticipating the sorting at the time you are entering the data and coming up with a useful scheme for doing that.  What I came up with managed these issues.  One of my questions in the title is whether for other such efforts equally useful ways of collecting the data are found.  I had some definite things I wanted to learn from the effort, so the schema arose from that.  My guess, however, is sometimes the questions will only become apparent after looking at the data for quite a while.  So an issue is whether the schema adopted in the interim can give way to a more suitable schema for addressing the questions.

I learned a variety of little facts from going through this exercise and in that way became an expert about College class scheduling where nobody else in the College had this expertise then, because nobody had looked at the information in this way.  Soon thereafter, however, the real experts - the department schedulers - took over and became much better informed than I was.   Here's a review of some of the things I learned - most obvious in retrospect but not ahead of time.  A couple of things I thought more profound.

1.  The vast majority of classes met two days a week.  That's how it was for me when I was in graduate school and courses were for four credit hours, which met in two different two-hour blocks.  As an undergraduate I had mainly three credit-hour classes and then meet three times a week in one-hour blocks.  That three-times a week model seems to have gone by the wayside except in the high enrollment classes, where the third meeting time is in discussion section.

2.  Business Administration and  Finance classes were three hours per week, both at the undergraduate level and the graduate level.  Accountancy classes were four hours per week, undergraduate and graduate (and the high enrollment classes had four hours in lecture and then a discussion section or lab).  I think the "why" question for this outcome itself quite interesting and worthy of further analysis, but I won't concern myself with it here.  For the purposes of scheduling alone, it produced an important consequence.  Accountancy was on a de facto grid.  Classes started at 8 AM or 10 AM, but never at 9 AM.  That carried through the rest of the day. With the other departments there was no standard starting time.  A class might start a 9 AM or at 9:30 AM or at 10 AM.

3.  There was some classes that met for 3 hours straight, once a week.  Presumably these emerged via faculty request, so they could consolidate their teaching to one day a week.

4.  Accountancy had classes in the evening.  The other departments did not.   Nighttime classes are not preferred.  So this showed a space crunch for Accountancy.

5.  Mainly the new building was an improvement in instructional space but not that much of an increase in overall classroom capacity for the College.  (Priority assignment in the Armory, DKH, and some of Wohlers would have to be given up after the new building came into operation.)

6.  One classroom is not a perfect substitute for another.  In addition to number of seats, some classrooms were flat, where furniture could be more easily arranged on the fly, and other classrooms were tiered, where the writing surface was fixed.  Accountancy had a distinct preference for flat classrooms.  The other departments preferred tiered classrooms.  That was more of a differentiator across classrooms than technology in the room or other factors. 

Not all the information I needed arose from this data collection effort.  Another thing I learned from meeting with Campus Scheduling in the Registrar's Office was that the Campus faced limits with their scheduling because, in essence, we were too big for the scheduling software that was being utilized, run by the University (so supporting Chicago and Springfield as well as Urbana).  They therefore applauded these efforts in pre-scheduling and wanted to help us rather than view this approach as competition to what they did.

I also learned that each of the departments had a shadow scheduling system, idiosyncratic to the department, for doing the scheduling work.  And I learned further that the department schedulers were very nice people who were happy to collaborate with one another, but they hadn't done so in the past because there wasn't a perceived need then, and nobody took it upon herself to coordinate the effort.  One of the big benefits attained from the procurement of the scheduling software and having the vendors come to visit us to showcase what they had was in it solidifying the collaboration between the department schedulers.


I have belabored the lessons learned above to show what other research efforts of this sort might produce.  The approach is somewhat different from how economics is normally done - where a model is constructed prior to the data collection, hypotheses are generated from the model, and then the data are used principally to test the hypotheses.  Economics has theory in the predominant position and data in a subordinate role.

What I learned from my Administrator work, and particularly from some early conversations with Chip Bruce, is that with evaluation efforts, in particular, it is often better to be an anthropologist and consider the data collection as mainly being descriptive about what is going on.  There is learning in that.  We are often far too impatient to try to explain what we observe and render judgment on it, thumbs up or thumbs down.  That is less interesting, especially when it is done prematurely, and it will tend to block things we might learn through the descriptive approach.

There is a related issue, one I think most people don't have a good handle on, which is that much real learning is by serendipity.  One doesn't really know what will be found until one looks.  Thus the looking itself becomes an interesting activity, even if it's just for the hell of it.  Of course, dead ends are possible.  One might desire to rule them out in advance.  I don't think that is possible.  That might explain why getting started on data collection is difficult.  In the old SCALE days we embraced the Nike motto - Just Do It.  That was for teaching with ALN.  The idea there was to learn by doing instead of doing a lot of course development up front.  It meant that an iterative approach had to be employed.  I believe more or less the same thing applies to data collection of this sort.  

I now have a couple of undergraduate students doing a summer research project under me that might extend into the fall, if they make headway now.  These students took a class from me last fall, did reasonably well in that course, and were looking to do something further along those lines.

They don't have my prior experience with data collection.  I wonder if they can get to something like my sense of understanding how to do these things from the project they are working on.  We'll see. 

No comments: