I have decided to give something a try. I am going to write code to go through the first daily note file and parse it as best as it can. But, I will also keep track of the items it couldn’t successfully parse. Like I did at the end of the last post. I will write the parsed and unparsed data to a file. Something similar to the rain gauge files—that way I can hopefully use the previous code to extract and add the data to the database table(s) and csv file(s).

Once the file has been processed, I will go through the file manually and look for things that need fixing or eliminating. If that works reasonably well for the first file, bark_gdn_2022.php, I will do so for the remaining files including the current year-to-date.

So, let’s have a look at that file for 2022.

Refactor the Generator

Because I want to see all potentially incorrectly parsed notes, I am going to refactor the generator to return any note that contains any of the following: rain, snow, mm, cm. Hopefully that doesn’t overload the file. I was getting some problem rows, so I added a restriction: no rows containing “no rain”.

def get_rg_data(rg_fl, f_typ="rg"):
  '''Generator for raingauge php files, only want certain lines so decided use generator
     Parameters:
      rg_fl: full path to file
      f_typ: which file version is file
        'rg' = rain gauge style,
        'dn' = daily note style,
        'dna' = daily note style, but less restrictive than 'dn'
  '''
  with open(rg_fl, "r") as ds:
    c_dt = ""
    for ln in ds:
      # print(ndx)
      ln = ln.strip()
      do_yld = False
      if f_typ == "rg" and ln[:4] == "<br>" and ln[8] == "." and ln[11] == '.' and ln[14] != ":":
        do_yld = True
      if f_typ == "dn" or f_typ == "dna":
        if ln[:4] == "<h3>":
          c_dt = ln[4:14]
        # get items that may contain rain (mm) or snowfall (cm) info
        if f_typ == 'dn':
          if "no rain" not in ln and "mm " in ln or "cm " in ln:
            do_yld = True
        elif f_typ == 'dna':
          if ("rain" in ln or "snow" in ln or "mm " in ln or "cm " in ln) and "no rain last" not in ln:
            do_yld = True
        if do_yld:
          ln = f"{c_dt} {ln}"
      if do_yld:
        yield ln

Process Test Month

Okay, here’s the test code I used for the above. Once again just processing the month of February

    if proc_fl_2:
      # process the data in the daily notes file month by month
      # add daily rainfall too database, record dates done
      p_mn, c_mn = "", ""               # prev mon, curr mon
      rf_rws, t_mn, m_dts = [], 0, {}   # rainfall rows, tot mon, mon dates
      c_src = 11

      # set up generator for combined data file
      gen_d = get_rg_data(d_srcs[c_src], f_typ="dna")

      c_mon = ""
      t_mon = 0
      s_mn = "02"
      e_mn = "02"
      not_mtch = False

      s_tm = time.perf_counter()
      for d_rw in gen_d:
        c_dt = d_rw[:10]
        c_ln = d_rw[11:]
        if c_mon != c_dt[5:7]:
          c_mon = c_dt[5:7]
          t_mon = 0
        if c_mon >= s_mn and c_mon <= e_mn:
          mtch = parse_rg_row(c_ln, f_typ="dn")
          if not mtch:
            s_nm = f"\nno match -> {c_dt}: {c_ln}" 
            print(s_nm)
            not_mtch = True
          else:
            d_tm, d_rf, d_unit = mtch.group(1), mtch.group(2), mtch.group(3)
            c_rf = float(d_rf)
            if d_unit == "mm":
              t_mon += c_rf
            else:
              t_mon += c_rf / 10
            s_mt = f"{'\n' if not_mtch else ''}{c_dt} {d_tm}: {c_rf} mm, {t_mon:.2f}"
            print(s_mt)
            not_mtch = False
        if c_mon > e_mn:
          break
      e_tm = time.perf_counter()
      print(f"proc time: {(e_tm - s_tm):.3f}")

And in the terminal I got the following. Sorry a touch lengthy.

(dbd-3.13) PS R:\learn\dashboard> python data2db.py

no match -> 2022.02.01: <li>Looks to be mainly clear. YVR says mainly clear, 1.7&deg;C, wind NW 17 km/h. Possibility of snow in forecast for tomorrow.</li>

no match -> 2022.02.01: <li>Special weather statement on weather.gc.ca. Snowfall of 2-5 cm headed our way for tomorrow. Glad we can stay home.</li>

no match -> 2022.02.02: <li>17:11, so far no where near the forecast snowfall accumulation. May not even have gotten past 1 cm.</li>

no match -> 2022.02.03: <li>Rain to strt the day. YVR says light rain, 1.3&deg;C, wind ENE 15 km/h.</li>

2022.02.03 08:00: 6.8 mm, 6.80
2022.02.04 08:00: 12.5 mm, 19.30

no match -> 2022.02.05: <li>Cloudy start to the day, doesn't appear to be raining. YVR says mostly cloudy, 5.3&deg;C, wind SSW 11 km/h.</li>

2022.02.05 08:00: 4.3 mm, 23.60

no match -> 2022.02.07: <li>Believe it is a cloudy start to this Monday. YVR says mostly cloudy, 6.2&deg;C, wind E 16 km/h. Seems to have been a wee touch of rain overnight. Forecast has possibility of some showers for the early morning and some wind as well.</li>

2022.02.08 08:00: 4.5 mm, 28.10
2022.02.09 08:00: 4.3 mm, 32.40
2022.02.11 08:00: 0.3 mm, 32.70

no match -> 2022.02.14: <li>Appears cloudy. YVR says light rain, 6.6&deg;C, wind SE 9 km/h. Didn't look to be raining here, but it is pretty dark (just 04:44, up a bit early).</li>

no match -> 2022.02.14: <li>05:52, just noticed the rain e-gauge has recorded some precipitation. Guess that rain has arrived.</li>

2022.02.14 08:00: 6.4 mm, 39.10
2022.02.15 08:00: 1.2 mm, 40.30

no match -> 2022.02.17: <li>Cloudy morning, but not raining. YVR says mostly cloudy, 5.7&deg;C, wind E 8 km/h. Forecast is for drizzle mixed with rain.</li>

no match -> 2022.02.19: <li>Cloudy to start the day. YVR says light rain, 5.0&deg;C, wind ESE 12 km/h.</li>

2022.02.19 08:00: 1.8 mm, 42.10
2022.02.20 08:00: 7.1 mm, 49.20

no match -> 2022.02.21: <li>Looks to be cloudy (05:00). YVR says light rain, 2.6&deg;C, wind ENE 12 km/h. Rain gauge doesn't indicate rain currently falling here.</li>

2022.02.21 08:00: 0.5 mm, 49.70

no match -> 2022.02.24: <li>Woke to a cm or so of snow covering the landscape &#8212; natural and man-made. YVR says mainly clear, -3.5&deg;C, wind E 8 km/h (wind chill -7). Forecast says back to more normal temperatures and rain starting Saturday.</li>

no match -> 2022.02.25: <li>Mostly clear and cool to start the day. YVR: mainly clear, -2.8&deg;C, wind E 10 km/h, wind chill -7. Looks like a few days of rain starting tomorrow.</li>

2022.02.25 08:00: 0.8 mm, 50.50

no match -> 2022.02.27: <li>Cloudy, not currently raining. YVR: rain, 7.2&deg;;C, widn SSE 17 km/h. Rainfall warning in effect, the worst to the north and east of us it looks like..</li>

2022.02.27 08:00: 14.7 mm, 65.20

no match -> 2022.02.28: <li>Rain to start the day. 45 mm rain since Saturday morning. YVR: light drizzle, 7.0&deg;C, wind E 22 km/h. E-thermo on garage has us at 6&deg;C, but only 85% humidity versus 99% at YVR. Expect ours is wrong and likely temperature it is reporting is incorrect.</li>

2022.02.28 08:00: 30.7 mm, 95.90

no match -> 2022.02.28: <?php /* 97.0 - 5.4 = 91.6, 91.6 - 60.9 = 30.7 mm */ ?>
proc time: 0.038

Write to File

Okay, let’s refactor the above code to write the February data to a file.

... ...
      not_mtch = False
      c_mn_dn = []

      # path to file for writing rainfall data
      # F:\BaRKqgs\gdn\bark_gdn_2022.php
      cwd = Path(__file__).cwd()
      fl_pth = cwd/"data"
      fl_yr = d_srcs[c_src][-8:-4]
      fl_nm = f"dn_{fl_yr}_parse.txt"
      dn_pth = fl_pth/fl_nm
... ...
        if c_mon != c_dt[5:7]:
          c_mon = c_dt[5:7]
          t_mon = 0
          # make sure we are appending not overwriting the file
          with open(dn_pth, "a", encoding="utf-8") as dna:
            dna.writelines(c_mn_dn)
          c_md_nd = []
        if c_mon >= s_mn and c_mon <= e_mn:
          mtch = parse_rg_row(c_ln, f_typ="dn")
          if not mtch:
            s_nm = f"\nno match -> {c_dt}: {c_ln}" 
            print(s_nm)
            c_mn_dn.append(f"{s_nm}\n")
            not_mtch = True
          else:
            d_tm, d_rf, d_unit = mtch.group(1), mtch.group(2), mtch.group(3)
            c_rf = float(d_rf)
            if d_unit == "mm":
              t_mon += c_rf
            else:
              t_mon += c_rf / 10
            s_mt = f"{'\n' if not_mtch else ''}{c_dt} {d_tm}: {c_rf} mm, {t_mon:.2f}"
            print(s_mt)
            # f_mt = f"{'\n' if not_mtch else ''}{c_dt} {d_tm}: {c_rf} mm, {t_mon:.2f}\n"
            c_mn_dn.append(f"{s_mt}")
            not_mtch = False
        if c_mon > e_mn:
          break
          # c_lp += 1
      if c_mn_dn:
        with open(dn_pth, "a", encoding="utf-8") as dna:
          dna.writelines(c_mn_dn)
      e_tm = time.perf_counter()

And I can assure you that the file data\dn_2022_parse.txt was created. And, its contents match those in the terminal display above.

Okay, I am going to modify the above to process the month of March.

And, the processed data was appended to the file. Though the calculated monthly rainfall does not agree with the spreadsheet. Will take a break and see if I can figure out why.

Fix February

When looking at the March notes, March 1st had this note. Which kind of explains the last note for February in the terminal output above.

  • 97.0 mm rain for February. 1.2mm so far in March. 6.6 24 hours. 5.4mm yesterday.
  • So, I added a note to February 28th, showing 5.4 mm at 24:00 (<li>24:00 &amp; 5.4mm rain since morning, 97.0mm for the month. Tube: ~?mm.</li>)

    Fix March

    March had the following daily note.

    2022.03.12

    • Cloudy, rain overnight. Doesn't appear to be raining at the moment.
    • 08:00 & 17.6mm/no rain last 24 hours, 45.2mm for the month. Tube: ~?mm.

    The note showing 17.6 mm of rain was being ignored because it contained the string “no rain last”. Fixed that.

    Fix January

    I used a spreadsheet to work out the values for January. The e-gauge was way behind the tube gauge (which is a lot less likely to be wrong). But I didn’t have tube readings for every day. So, I recorded what the e-gauge had for the days with recorded rainfall or equivalent. Then scaled the values according to the tube readings I had: January 1-9 @ 84 mm, January 10 - 12 @ 105.5 (fell in 72 hours). Additional days without scalling. Then put the results into the file, commenting out all the other rows for January.

    Remaining Months

    I am going to follow this approach for the remaining months in 2022, one at a time, and see how things go. I will use the code to add the parsed and un-parsed notes to the file. Compare the monthly total to my spreadsheet. If there is a discrepancy, I will sort out how to resolve it. Modify the source file according, delete the current month from the file and regenerate. Repeat until I am happy with the parsed data.

    E-gauge vs Tube Gauge

    I eventually found that the e-gauge periodically failed to record all the rainfall. The old tube gauge had no such problem. Well except when frozen or during snowfalls. And eventually, sometime in 2024 I think, the e-gauge just failed to work properly. Didn’t feel like buying another one. I have continued to record it’s output, but I only use/accept rainfall amounts from the tube gauge.

    So when I started working on October 2023, I decided to track the tube values along with the e-gauge values. Would give me the option to use one or both.

    New Parsing Function

    Didn’t want to mess around with the current parsing function as it really only looks for rainfall amounts as recorded from daily e-gauge values. So wrote another one to look solely for tube rainfall amounts. Added the following to the rg_data module.

    Bit of overkill in the regex, but I was testing somethings out and just left it that way. Not like it will be used at all frequently once I get these data files sorted and the rainfall amounts in the database.

    def parse_tb_row(d_rw, f_typ="rg"):
      """Parse supplied string to obtain the recorded time and rainfall/snowfall
         amount at the time recorded by the tube gauge.
    
        params:
          d_rw: the string to be parsed
        
        returns:
          result of appropriage regex match/search
      """
      # at least one note had sentence before the time and rainfall amount
      rgx = r".*?(\d{2}:\d{2}).*?tube: [\~\?]*?(\d+[\.\d]*) ?mm.*?</li>$"
      rx = re.compile(rgx, re.IGNORECASE )
      return rx.match(d_rw)
    

    Refactor Loop to Write Tube Data to File

    And, then I modified my file processing code to add the tube related data to the e-gauge data for each month. And, write that to the interim data file.

    ... ...
        if proc_fl_2:
    ... ...
          tb_mon = 0
    ... ...
            if c_mon != c_dt[5:7]:
              c_mon = c_dt[5:7]
              print(f"c_mon: {c_mon}")
              t_mon = 0
              tb_mon = 0
    ... ...
                s_mt = f"{'\n' if not_mtch else ''}e-gauge: {c_dt} {e_tm}: {d_rf} mm, {t_mon:.2f}"
    ... ...
            if c_mon >= s_mn and c_mon <= e_mn:
              mtch = parse_rg_row(c_ln, f_typ="dn")
              tb_mtch = parse_tb_row(c_ln)
    ... ...
              if tb_mtch:
                t_tm, t_rf = tb_mtch.group(1), tb_mtch.group(2)
                tb_mon += float(t_rf)
                c_mn_dn.append(f"tube: {c_dt} {t_tm}: {t_rf} mm, {tb_mon:.2f}\n")
    

    And, here’s what got written to file for October 2023.

    e-gauge: 2023.10.03 08:00: 11.0 mm, 11.00
    tube: 2023.10.03 08:00: 11 mm, 11.00
    e-gauge: 2023.10.10 08:00: 10.0 mm, 21.00
    tube: 2023.10.10 08:00: 10 mm, 21.00
    e-gauge: 2023.10.11 08:00: 5.0 mm, 26.00
    tube: 2023.10.11 08:00: 5 mm, 26.00
    e-gauge: 2023.10.14 08:00: 9.5 mm, 35.50
    tube: 2023.10.14 08:00: 9.5 mm, 35.50
    e-gauge: 2023.10.15 08:00: 1.5 mm, 37.00
    tube: 2023.10.15 08:00: 1.5 mm, 37.00
    e-gauge: 2023.10.16 08:00: 3.5 mm, 40.50
    tube: 2023.10.16 08:00: 3.5 mm, 40.50
    e-gauge: 2023.10.17 08:00: 19.5 mm, 60.00
    tube: 2023.10.17 08:00: 19.5 mm, 60.00
    e-gauge: 2023.10.18 08:00: 48.0 mm, 108.00
    tube: 2023.10.18 08:00: 48 mm, 108.00
    e-gauge: 2023.10.19 08:00: 24.0 mm, 132.00
    tube: 2023.10.19 08:00: 24 mm, 132.00
    e-gauge: 2023.10.20 08:00: 7.0 mm, 139.00
    tube: 2023.10.20 08:00: 7.0 mm, 139.00
    e-gauge: 2023.10.25 08:00: 17.5 mm, 156.50
    tube: 2023.10.25 08:00: 17.5 mm, 156.50
    

    I was actually surprised they both matched. Months earlier in 2023 did not. I determined that manually. And actually manually modified data so that the resulting monthly total matched the tube gauge data. As far as I can tell, October 2023 was the first month that year that I recorded the data from both gauges each day. We will see what November looks like.

    For both November and December there were days the tube data was not recorded daily. I added some estimated daily values (proportional to the e-gauge values). For both months the e-gauge reported significantly less rainfall than the tube gauge. ~145 mm difference for December.

    Took me around 5½ hours to process the 2023 notes file.

    Done

    I think I’m calling this post quits. Will continue parsing the remaining daily note files. If anything really strange happens will add to the post.

    Until next time, going with the flow seems to help in many situations.

    Afterwords

    I hadn’t been opening the daily note files as UTF-8. Had to fix that while working on November 2024.

    Only took around 2 hours to parse the 2024 note file into the intermediate format. The bulk of the work was in fact done by the code. Though what was missed, usually took more time than I would have liked to find and correct.