Open Methodology

Using Collab, GitHub and Anaconda to properly document our entire data science process

Keith Monreal
7 min readSep 17, 2020

Proper documentation in analytics is important for the concept of Open Science at the same time, it can help decrease model risk. One of the three main domains of Open Science is the accessibility to scientific source code or syntax. With transparency in the codes and processes performed, teams can replicate existing projects and can properly validate and revisit implemented models, hence reducing the risk of using wrong models.

Here is a step-by-step guide in using Google Collab, GitHub and Jupyter, Spyder in Anaconda from FTW Foundation TA Shanelle Recheta.

Introduction To Collab

Go to Collab and create a New Notebook. You will be redirected to a page similar to this.

You can rename the title of your notebook here:

Hover over a cell to add a Code or Text. A Code cell can be executed and a Text Cell is formatted using a simple markup language called Markdown. A guide can be found here. You can edit any public Collab notebook by saving a copy on your own drive. To do this, click on File > Save a copy in drive

You should be redirected to a new tab where you can edit and run the copy of the Collab notebook

Move cells up and down using the arrow key. To run a line of code, click on the cell you want to run and click press Shift + Enter. You can also comment on a cell. To delete a cell, click on the Delete button

Notebooks can be downloaded either as a python file (.py) or as a Jupyter notebook (.ipynb). Recommended to save it as a Jupyter notebook for instructional purposes. Recommended to save it as a python file for production / devops purposes.

Introduction to GitHub

Create an account in GitHub

Download GitHub for Windows. You may follow this YouTube tutorial

Create a new repository on your Github account

Input repository name and description.

Using the Command Line Prompt, clone your repository.
Windows R > CMD > git clone, and paste the URL of your repository.

Copy the files to the same folder

Go into the directory that contains the project. To add the new files in the repository: git add. To check the status of these files: git status. To commit the changes to the repository: git commit

However, the highlighted red box is an error. You may want to run the codes below before proceeding to git commit. Specify first your GitHub Email and username in the command prompt.

Type git push -u origin master and log in with your GitHub credentials.

Refresh your GitHub repo in the website and check if all files were uploaded properly. You can also upload files directly to the GitHub repo using the Add File Button

Other Useful Tips

You can clone any repository on the web. You can search for Data Science Cheat Sheets and select “All GitHub” and choose the repo you want to clone / fork / explore…

Say you have found the repo you are looking for, you can Watch or Star a repo, or

  • Fork (fork is a copy of a repo; forking a repository allows you to freely experiment with changes without affecting the original project) or
  • Clone (cloning creates a local copy of the remote repo; this allows you to make all of your edits locally rather than directly in the source files of the origin repo) it to your own repository

To clone a repo using command line, go to the command prompt and type : git clone url-of-the-repo

Then, when you go to the directory where the clone was stored, you should be able to access the files. You can do this using the command prompt or using File Explorer. To check, you can type dir and all the contents of the cloned repository should be displayed.

  • By File Explorer, look for the folder name where you cloned the repo. In this example, it’s C: > Users > Keith Monreal

Introduction to Anaconda, Jupyter and Spyder

1. Jupyter

Launch Jupyter Notebook in the Anaconda Homepage

In this exercise, you will open the cloned repository from before, sample:

Open the folder of the cloned repo. For this exercise, make sure that the csv file you want to load is in the same folder.

Create a new python 3 notebook

Rename the notebook

You can either add a Code cell or a Text cell. For text cells, you can format using markdown. To run a cell, press Shift + Enter. Or, you can press Run. There are also options to Run all Cells or Cells below / above.

You should be able to see the generated csv file in the same folder after running df.to_csv(). You can check the generated csv file; all rows with missing values should have been removed.

2. Spyder

To do this on Spyder, launch Spyder from Anaconda Navigator

Upon launch, you should see something like this. On the left side is the text editor where we will be writing our program. On the top right side, you can see the Variable Explorer where the variables we generate will later be displayed. On the bottom right side is the Console where output will be displayed.

However, unlike Jupyter notebook, there are no markdowns on Spyder. If you would like to add a note or explanation for your code, you can comment it out by adding in the ‘#’ sign at the start of the line you want to comment out. Or, you can highlight a line or snippet, then press Ctrl + 1

Save the file as python file (*.py) in the same directory where the dataset is saved.

You can double-click on a variable displayed on the variable explorer, and it’s contents will be displayed

Shortcuts

You can save a copy of your Colab notebook directly to Github (without using the command prompt) by clicking on “Save a Copy on Github” and choosing the repository where you want to store the copy.

Choose “master” as the branch by Default. You can also type in commit message

Wait to finish creating the copy. You should see the notebook on your repo as when you refresh.

--

--