Open Methodology
Using Collab, GitHub and Anaconda to properly document our entire data science process
Proper documentation in analytics is important for the concept of Open Science at the same time, it can help decrease model risk. One of the three main domains of Open Science is the accessibility to scientific source code or syntax. With transparency in the codes and processes performed, teams can replicate existing projects and can properly validate and revisit implemented models, hence reducing the risk of using wrong models.
Here is a step-by-step guide in using Google Collab, GitHub and Jupyter, Spyder in Anaconda from FTW Foundation TA Shanelle Recheta.
Introduction To Collab
Go to Collab and create a New Notebook. You will be redirected to a page similar to this.
You can rename the title of your notebook here:
Hover over a cell to add a Code or Text. A Code cell can be executed and a Text Cell is formatted using a simple markup language called Markdown. A guide can be found here. You can edit any public Collab notebook by saving a copy on your own drive. To do this, click on File > Save a copy in drive
You should be redirected to a new tab where you can edit and run the copy of the Collab notebook
Move cells up and down using the arrow key. To run a line of code, click on the cell you want to run and click press Shift + Enter. You can also comment on a cell. To delete a cell, click on the Delete button
Notebooks can be downloaded either as a python file (.py) or as a Jupyter notebook (.ipynb). Recommended to save it as a Jupyter notebook for instructional purposes. Recommended to save it as a python file for production / devops purposes.
Introduction to GitHub
Create an account in GitHub
Download GitHub for Windows. You may follow this YouTube tutorial
Create a new repository on your Github account
Input repository name and description.
Using the Command Line Prompt, clone your repository.
Windows R > CMD > git clone, and paste the URL of your repository.
Copy the files to the same folder
Go into the directory that contains the project. To add the new files in the repository: git add. To check the status of these files: git status. To commit the changes to the repository: git commit
However, the highlighted red box is an error. You may want to run the codes below before proceeding to git commit. Specify first your GitHub Email and username in the command prompt.
Type git push -u origin master and log in with your GitHub credentials.
Refresh your GitHub repo in the website and check if all files were uploaded properly. You can also upload files directly to the GitHub repo using the Add File Button
Other Useful Tips
You can clone any repository on the web. You can search for Data Science Cheat Sheets and select “All GitHub” and choose the repo you want to clone / fork / explore…
Say you have found the repo you are looking for, you can Watch or Star a repo, or
- Fork (fork is a copy of a repo; forking a repository allows you to freely experiment with changes without affecting the original project) or
- Clone (cloning creates a local copy of the remote repo; this allows you to make all of your edits locally rather than directly in the source files of the origin repo) it to your own repository
To clone a repo using command line, go to the command prompt and type : git clone url-of-the-repo
Then, when you go to the directory where the clone was stored, you should be able to access the files. You can do this using the command prompt or using File Explorer. To check, you can type dir and all the contents of the cloned repository should be displayed.
- By File Explorer, look for the folder name where you cloned the repo. In this example, it’s C: > Users > Keith Monreal
Introduction to Anaconda, Jupyter and Spyder
1. Jupyter
Launch Jupyter Notebook in the Anaconda Homepage
In this exercise, you will open the cloned repository from before, sample:
Open the folder of the cloned repo. For this exercise, make sure that the csv file you want to load is in the same folder.
Create a new python 3 notebook
Rename the notebook
You can either add a Code cell or a Text cell. For text cells, you can format using markdown. To run a cell, press Shift + Enter. Or, you can press Run. There are also options to Run all Cells or Cells below / above.
You should be able to see the generated csv file in the same folder after running df.to_csv(). You can check the generated csv file; all rows with missing values should have been removed.
2. Spyder
To do this on Spyder, launch Spyder from Anaconda Navigator
Upon launch, you should see something like this. On the left side is the text editor where we will be writing our program. On the top right side, you can see the Variable Explorer where the variables we generate will later be displayed. On the bottom right side is the Console where output will be displayed.
However, unlike Jupyter notebook, there are no markdowns on Spyder. If you would like to add a note or explanation for your code, you can comment it out by adding in the ‘#’ sign at the start of the line you want to comment out. Or, you can highlight a line or snippet, then press Ctrl + 1
Save the file as python file (*.py) in the same directory where the dataset is saved.
You can double-click on a variable displayed on the variable explorer, and it’s contents will be displayed
Shortcuts
You can save a copy of your Colab notebook directly to Github (without using the command prompt) by clicking on “Save a Copy on Github” and choosing the repository where you want to store the copy.
Choose “master” as the branch by Default. You can also type in commit message
Wait to finish creating the copy. You should see the notebook on your repo as when you refresh.